cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Decide what to do with coverage reporting in presence of large deletions. #1193

Open Donaim opened 3 weeks ago

Donaim commented 3 weeks ago

Decide what to do with coverage reporting in presence of large deletions.

Currently, we can have following two cases: 1) query aligned as 100M600D100M somewhere in the reference. Then coverage values for the big deletion in the middle are missing. (reference region is not covered by query) 2) query aligned as 100M599D100M somewhere in the reference. Then coverage values for the big deletion in the middle are present (reference region is covered by query).

The threshold of 600 deletions is sort of arbitrary.

We would like to develop a better decision procedure on what to report as "coverage". Possibly, one that looks into the individual reads (from fastq files) in order to see whether it was the reads that spanned the big deletion, or whether the query is two separate consensus sequences "stitched" together.

Donaim commented 3 weeks ago

The current threshold is defined here: https://github.com/cfe-lab/MiCall/blob/5d90205006c9299434d4f555fbcc62fb48c65ba4/micall/utils/consensus_aligner.py#L34