fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
67 stars 7 forks source link

Preprocessing not possible with extracted fast5 files #46

Closed SiegCoppens closed 2 years ago

SiegCoppens commented 2 years ago

Dear

We are running NanoDisco on a dataset of which the native fast5 files originates from a subset of the original fast5 dataset (multiple barcodes within a single file). This subset was obtained using the ONT fast5_subset command. Nevertheless, whenever running these datasets, we do get following error in the preprocessing step:

[2022-05-25 16:41:28] Extract sequences from fast5. Warning message: In extract.sequence(path_input, base_name, path_output, nb_threads, : 1 reads weren't basecalled. No reads were extracted. Please check that -f/--path_fast5 is correct.

On the other hand did the amplified dataset (which did not originate from a subset of a bigger fast5 file and was generated as a single barcode fast5 after sequencing) gets processed properly. Is it possible there is a difference in the subset fast5 format which cannot be processed using the NanoDisco preprocess command?

Thank you in advance. Regards Nick

touala commented 2 years ago

Hi @SiegCoppens,

Thank you for using nanodisco, we will try to sort this out.

When the original basecalled file is multiplexed (multiple barcodes) I usually execute the demux_fast5 command instead of fast5_subset. For those situation, I have a small explanation in nanodisco FAQ (Q14). You can also run the command with the additional -c gzip option if the fast5 were compressed with vbz.

Please let me know if this is not fixing the issue.

Regards,

Alan

SiegCoppens commented 2 years ago

Hi @touala !

Thank you for the quick response. We also tried to run Nanodisco on the full original fast5 dataset, this gave the same error:

Warning message:                                                              
In extract.sequence(path_input, base_name, path_output, nb_threads,  :
  601 reads weren't basecalled.
No reads were extracted. Please check that -f/--path_fast5 is correct.

this tells us the issue has nothing to do with the demultiplexing or subsetting of the original fast5.

We tried to perform the basecalling manually with the guppy_basecaller command and this runs without any issues.

This native fast5 dataset was generated in January 2020, so it's 2 years old. Could this be the problem? Since our amplified dataset (which was generated more recently) runs without any issues.

Thank you! Regards, Sieglinde

touala commented 2 years ago

Hi @SiegCoppens,

I don't think the 2 years old fast5 are the problem. Does their basecalling was done with the --fast5_out option? If it's the case, could you consider sharing one fast5 by email so I can take a look?

Best,

Alan

SiegCoppens commented 2 years ago

Hi @touala ,

We now basecalled our fast5 with guppy_basecaller with the --fast5_out option, and then used the resulting fast5 as input for the nanodisco preprocessing. So our issue is solved.

Thank you very much for your help!

Best, Sieglinde