cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
419 stars 108 forks source link

beginning and end of reference masked? #529

Closed smsaladi closed 10 months ago

smsaladi commented 10 months ago

Hello:

It looks like 265 bases at the beginning and 230 at the end of the reference genome are masked (along with any sequences analyzed with pangolin) when looking at sequences.withref.fa. Would you mind sharing why this is done - just for our understanding?

https://github.com/cov-lineages/pangolin/blob/2f2756/pangolin/data/reference.fasta

If I should post this somewhere somewhere else, e.g. a listserv, please let me know.

Thanks!

AngieHinrichs commented 10 months ago

Hi @smsaladi -- yes, the 5' and 3' untranslated regions (UTRs) of the sequences are masked so that only the protein-coding middle portion is considered when assigning lineages. @aineniamh and @rmcolq can correct me if I'm wrong but I think there were two reasons for this:

  1. Sequencing coverage is generally lower at the beginning and end of the genome and sequencing errors are more common. However, this doesn't apply to the full UTRs. For example, the Problematic Sites set masks only the first 55 and last 100 bases.

  2. Changes to the protein-coding regions are considered more important because they are more likely to have a functional effect.

smsaladi commented 10 months ago

Thanks makes sense - very helpful!