Choosing basecaller version

LynnLy commented 3 years ago

Hi! I am trying to do methylation binning - I see in your FAQ that my native DNA and WGA datasets should use the same basecaller and version. My datasets were generated at different times and basecalled with different versions - I can rebasecall them with the same version, but I only see a place to specify the fast5 files and not basecalled fastq files.

Is there a way to specify which basecalled fastqs I want to use, and to only rely on the fast5 for signal information? Or, a way to make sure the fastqs used from the fast5s are the correct ones, because the same fast5 files may be basecalled twice and hold two sets of fastq data?

Thank you!

touala commented 3 years ago

Hello @LynnLy,

Unfortunately, I do not provide a way to select which basecalling version to use with the current implementation. nanodisco interact with fast5s at two steps: nanodisco preprocess (extract read sequences) and nanodisco difference (align signal with nanopolish). I've confirm with Jared that nanopolish only needs the reads from the desired version. I'll try to add the feature in nanodisco preprocess today but if you are in a hurry you can basecall the fast5 again so that the same version is found in Basecall_1D_000 for both datasets.

Alan

touala commented 3 years ago

I have implemented the new feature which I think can address the issue you raised. Two files need to be replaced: extract.R and preprocess.sh. You can find them here and integrate them to the container by doing:

wget https://github.com/fanglab/nanodisco/files/5011419/nanodisco_feature.zip # Download .zip mentioned above
unzip nanodisco_feature.zip

# Create a writable temporary container (directory) named nd_tmp, ~5 min
singularity build --sandbox nd_tmp nanodisco.sif 

mv extract.R preprocess.sh nd_tmp/home/nanodisco/code # Replace function with new feature
chmod 755 nd_tmp/home/nanodisco/code/* # Set proper permission

# Create a new container with the additional feature
singularity build nd_env nd_tmp

You can now provide the --basecall_version option (<basecaller:version>) to specify which basecalling version you want to use (e.g. Guppy:4.0.14 or Albacore:2.3.4). In your case nanodisco preprocess can be executed as follow:

nanodisco preprocess -p <nb_threads> -f <path_fast5> -s <name_sample> -o <path_output> -r <path_reference_genome> --basecall_version <basecaller:version>

Please let me know if this solves your issue or if you have any additional questions.

Alan

LynnLy commented 3 years ago

Great! I just tested this out with a rebasecalled Guppy dataset with the --fast5_out option, and it seems to be working. Thanks!

fanglab / nanodisco

Choosing basecaller version #4