artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 69 forks source link

Fix align_trim to deal with long deletion containing left primer #111

Open hsnguyen opened 2 years ago

hsnguyen commented 2 years ago

Recently we had an issue with artic generating consensus sequences with significant of abnormal SNPs called at ORF8 region.

Double-checking the alignment, we found that there's an extremely long deletion of 197nt, from 27984-28180 inclusively, that fully contains the SARS-CoV-2_94_LEFT primer v4.1 (27996-28021). The top track in the Figure below present the true alignment using minimap2 (sorted.bam).

The align_trim doesn't seem to cover this situation well, resulted in wrong CIGAR after the soft masking. The misalignment then created the false SNPs in this region as shown in the middle track of the Figure (_false.trimmed.rg.sorted.bam)

The fix is simple when I only tried to cover this bug, not attempting to call the long SNPs correctly (also understand that D can't appear right after S in valid cigar string). So after trimming in-between primer 94 pairs (highlighted in red) from start using the fixed version, the deletion's gone too (bottom track). Please double-check as I'm not fully aware of other situations when this change may affect the trimming in an unexpected way.

artic-PR

I'm using the artic-1.3.0-dev branch but the same issue also found on the master branch so you might need to apply the fix as well if applicable. Of course if there's a way to capture the deletion in the consensus sequence that'd be much better. @mjsull @nickloman @BioWilko @will-rowe

Thanks,