fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
68 stars 7 forks source link

Providing a basecalled reads in FASTQ file #56

Open hasindu2008 opened 2 years ago

hasindu2008 commented 2 years ago

Is it possible to make nanodisco accept a FASTQ file that contains basecalled reads, rather than extracting this from FAST5 files? This way, the need to rebasecall with --fast5-out will no longer be necessary I believe?

touala commented 2 years ago

Hello @hasindu2008,

Unfortunately this is not readily implementable but should be doable "by hand". I made this design choice a while ago, because I found it to be the least error prone as it assure that the fast5, fastq, and bam matches. But this is indeed less efficient. Please let me know if you want a high level alternate solution.

Best,

Alan

hasindu2008 commented 2 years ago

Do you only need the base called read in the FAST5 file generated with --fast5-out or do you rely on the move table as well?

touala commented 2 years ago

Basically, we need to be able to execute nanopolish eventalign for aligning events on the reference. The fastq are extracted and contain the path to the fast5 in each read's header which I found, at the time, to be efficient for indexing. I don't know if this is still the case.

hasindu2008 commented 2 years ago

Ohh, I suggest trying replacing nanopolish with f5c and both indexing (no need to have in the header) and event alignment will be much faster (~3-5X) with near-identical results.

f5c index -d fast5_dir in_fasta -t num_threads --iop num_threads
f5c eventalign -t num_threads --iop num_threads --scale-events -n -r in_fasta -b in_bam -g tmp_genome

You can make it 10X faster if you switch to BLOW5 format with added advantages such as less backward compatibility headaches and saving a lot of unnecessary dev time. slow5tools can be used to streamline many signal merge/split/get operations and both nanopolish and f5c are compatible with BLOW5 format.

f5c index -t num_threads  in_fasta  --slow5 signals.blow5
f5c eventalign -t num_threads --iop num_threads --scale-events -n -r in_fasta -b in_bam -g tmp_genome --slow5 signals.blow5

In the previous response, can you please explain what you meant by matching fast5, fastq, and bam matches? Each multi-fast5 files separately run with nanopolish in your script or do you concatenate all the the FASTQ and then run one nanopolish instance?

ecpierce commented 2 years ago

Hi @touala,

I am having trouble generating files with --fast5-out and so have a similar question. Can you clarify what you mean by "but should be doable by hand"? I may need to go this route. I have basecalled fastq files. I agree with hasindu that this solution may become important since it seems like nanopore is planning to remove the fast5-out option.

Thanks! Emily

jflopezfernandez commented 1 year ago

@ecpierce Hi, Emily, we ran into the --fast5-out option deprecation problem ourselves, and we opted to just download an older version of Guppy rather than figuring out a way to be able to use *.fastq files. As of this writing, it looks like version 6.4.2 is the most recent, but version 6.2.1 is the most recent version prior to the deprecation of the --fast5-out option in version 6.3+.

ecpierce commented 1 year ago

@jflopezfernandez thank you for your response! That is the solution I ended up using. It would be useful though if nanodisco developers consider working on a long-term solution so that it will be compatible with even newer Guppy versions in the future. It seems like Dorado uses pod5 format- not sure how that would impact things but I guess something else to consider if nanodisco is going to be actively maintained. Really appreciate this awesome program!

fanggang commented 1 year ago

Thank you very much for sharing your experience and solutions to other users, Jose!

For the question from Emily: we are very much encouraged by the broad interests in Nanodisco, and yes we are committed to maintain it in the long term. This being said, because Nanopore software and kits are constantly evolving, our strategy (given the finite resources we have) is to 1) use Singularity to ensure the current package versions are compatible and the entire workflow is reliably working; 2) we do plan to release major upgrades: it would not be frequent (given the nature of nanopore software/kit evolution explained above), but we will do it for major milestones!

Best, Gang

ecpierce commented 1 year ago

@fanggang that makes sense. I appreciate your work and am glad to hear you are committed to maintaining!