Closed smsaladi closed 10 months ago
Hi @smsaladi -- yes, the 5' and 3' untranslated regions (UTRs) of the sequences are masked so that only the protein-coding middle portion is considered when assigning lineages. @aineniamh and @rmcolq can correct me if I'm wrong but I think there were two reasons for this:
Sequencing coverage is generally lower at the beginning and end of the genome and sequencing errors are more common. However, this doesn't apply to the full UTRs. For example, the Problematic Sites set masks only the first 55 and last 100 bases.
Changes to the protein-coding regions are considered more important because they are more likely to have a functional effect.
Thanks makes sense - very helpful!
Hello:
It looks like 265 bases at the beginning and 230 at the end of the reference genome are masked (along with any sequences analyzed with pangolin) when looking at
sequences.withref.fa
. Would you mind sharing why this is done - just for our understanding?https://github.com/cov-lineages/pangolin/blob/2f2756/pangolin/data/reference.fasta
If I should post this somewhere somewhere else, e.g. a listserv, please let me know.
Thanks!