artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 68 forks source link

coverage-based instead of counter-based normalisation #71

Open MarkusHaak opened 3 years ago

MarkusHaak commented 3 years ago

This pull request is to address normalisation problems we encountered while experimenting with sequencing SARS-CoV2 using long amplicons (https://www.biorxiv.org/content/10.1101/2020.05.28.122648v3) and rapid sequencing kits. In these cases, the amplicon coverage essentially follows a normal distribution and counter-based normalisation often leads to low coverage terminal regions close to the overlaps of two amplicons.

Instead of simply counting the number of reads for each primer pair, the coverage of both strands is tracked in terms of start and end points of alignments. A read is dropped only if the strand-specific coverage of every position in the aligned region is already equal to or above the requested normalisation threshold. In most cases, this should only marginally influence the behaviour of the align_trim script in that it makes the normalisation threshold a lower boundary instead of an upper boundary.

While the coverage is tracked for each strand individually, it is currently not tracked individually for each amplicon in overlap regions. Even though I cannot think of a scenario where this might be problematic, I wanted to mention this in case this is of importance in any use case.