fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
66 stars 7 forks source link

Guppy compatibility #17

Closed nathanjohns closed 2 years ago

nathanjohns commented 3 years ago

Hello,

I'm planning on running nanodisco on isolate genomes that I have generated native and WGA data for recently. My lab has a fast Guppy GPU Docker image for demultiplexing and basecalling the raw multi-read fast5 files and I would prefer not try to figure out how to work with Albacore. I got nanodisco working on the test E. coli data provided and I just wanted to make sure it will work for my data, at least for the de novo motif discovery. I understand that the methylation type fine mapping for guppy is still being trained.

Nathan

touala commented 3 years ago

Hello Nathan,

For the de novo motif detection, we expect that the best result should be obtained from the more accurate basecaller (tested on Albacore v1.1.0 vs v2.3.4 vs Guppy 3.2.4). Therefore using the latest Guppy (high-accuracy model) is likely the better approach. Regarding the methylation typing and fine mapping, while the current model is indeed trained from Albacore datasets, we obtained excellent results using Guppy v4.2.2 base called data in one of our recent analyses.

Please feel free to reach back if you face any issue.

Regards,

Alan

nathanjohns commented 3 years ago

Thanks for the clarification. I've gone forward and started to implement this for a strain of interest as a pilot. The preprocess function makes .fasta, .sorted.bam, and .sorted.bam.bai files for both _NAT and _WGA samples. They appear to be 288x and 211x coverage. However, when I go to run the difference function it starts to create the relevant temporary files in /difference_subset for my WGA sample but not for my native library, code: nanodisco difference -nj 2 -nc 1 -p 2 -f 100 -l 115 -i analysis/preprocessed_subset/ -o analysis/difference_subset -w BT_WGA -n BT_NAT -r raw_data/ref_genomes/BT.fasta

and gives the following error output, repeating for each chunk and not resulting in any .rds files:

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete local:2/0/100%/0.0s Error in { : task 1 failed - "no applicable method for 'droplevels' applied to an object of class "c('integer', 'numeric')"" Calls: %do% -> Execution halted local:2/1/100%/359.0s Error in { : task 1 failed - "no applicable method for 'droplevels' applied to an object of class "c('integer', 'numeric')"" Calls: %do% -> Execution halted

I've uploaded some of the files here: https://www.dropbox.com/sh/qj7rhok95ik6xd7/AAD-B7tEs8q4v5Zo0VvI_ozSa?dl=0

This error does not happen when I run the provided E. coli test data. Please let me know if you have any idea what's going wrong here.

Nathan

touala commented 3 years ago

Hi Nathan,

Thank you for providing the temporary files, this was really helpful.

I think I've pinpointed the issue. In nanodisco implementation I assumed that contigs would be named using character strings but issues arise if the contigs are named using integers only. I will include a failsafe in the next version. Meanwhile, I think the simplest solution for you would be to rename the contigs in the reference .fasta from <int> to contig_<int> and restart the preprocessing step.

Please let me know if this fixes the issue or if you have any other question.

Regards,

Alan