fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
66 stars 7 forks source link

Changes in coverage before and after Nanodisco preprocessing #29

Closed GeorgiaBreckell closed 1 year ago

GeorgiaBreckell commented 2 years ago

Hi Alan,

We normalized our fast5 coverage across multiple samples to around 120x prior to running Nanodisco, we checked coverage with BWA and used the same commands Nanodisco is running. When we looked at the coverage reported by the preprocessing bam outputs we observed coverage dropped by 20-40X depending on the sample.

I'm assuming some of the fast5 reads are not being converted to fasta, or are being filtered out but aren't sure why or what we can change to avoid this.

Do you have any insight into why this might have occurred and how we can avoid this?

Regards

touala commented 2 years ago

Hi @GeorgiaBreckell,

I'm surprised that it doesn't give you similar coverage. I often downsample high coverage datasets and I don't remember seeing > 10% discrepancy. Also, the data is not filtered during the nanodisco preprocessing command, all alignments are conserved. I understand that you used the following command to preform the original mapping? Which bwa version was used (nanodisco v1.0.3 uses 0.7.15)?

bwa mem -t $nb_threads -x ont2d $path_reference_genome $path_fasta 

How did you proceed to downsample the dataset? Sometimes I use the fast5_subset commands from the ONT API to generate new set of fast5. Otherwise you can directly random sample the .fasta file output from nanodisco preprocessing and remap the subset of reads. But both should give you similar results. Do you have a way to check if all the expected reads are found after downsampling+preprocessing?

Maybe the discrepancy comes from how the initial coverage is computed? How do you compute it? Is it using only the reads in pass folder for example?

Sorry for the barrage of question, I'm not sure where to look. I suppose you could, by default, add back the 20% margin when downsampling but it could hide an important issue.

Alan