broadinstitute / viral-classify

viral-ngs: taxonomic identification, classification, and filtering steps
Other
1 stars 0 forks source link

deplete_bwa_bam removes all single-ended reads #5

Closed dpark01 closed 4 years ago

dpark01 commented 4 years ago

BWA-based depletion is implemented a bit differently than the other depletion tools. Instead of finding hits and using Picard FilterSamReads to subset the original bam, this depletion implementation aligns with bwa to directly produce bam output, filters out the mapped reads (in proper pairs), keeps the unmapped ones, and runs Picard RevertSam for cleanup.

From what I can tell, it appears that single-end reads do not survive this step, regardless of how they map to the depletion database.

Example log file: https://platform.dnanexus.com/projects/Fk2gKXQ0GY2Z20FQ6x0p98JV/monitor/job/FkyPBBQ0GY2fpg2f5GXvvBfj

tomkinsc commented 4 years ago

I took a look at this; the bwa step allows single-end reads through just fine (based on the intermediate outputs when run using the input from the linked example); the samtools -F0x2 filter returns all reads that are not members of fully-mapped pairs. That means some pairs may be allowed through where with only one mate maps to the depletion database. Fairly permissive at preserving data for downstream processes (maybe more permissive than we should be), but it does not exclude single-end reads, whether alone or as part of a pair where only one read maps.

It seems the problem was actually with the mvicuna step, and specifically how read IDs were processed. Code from long ago expects read IDs to have a /1 mate suffix, and only includes those that do, which single-end reads do not have (nor interleaved fastqs, but that's a separate issue and we're generating the fastqs for this internal function ourselves so interleaving should not be an issue). So no single-end IDs were preserved.

A separate issue is that mvicuna does not de-duplicate single-end reads at all if the code comments are correct, which should be addressed in the future when we merge https://github.com/broadinstitute/viral-ngs/pull/970.

I'll have a PR to close this issue soon.