cov-lineages / lineages

Resources for calling and describing the circulating lineages of SARS-CoV-2
Other
37 stars 10 forks source link

Some coding region SNPs are not represented in anonymised.aln.fasta.treefile #2

Closed joshsinger closed 4 years ago

joshsinger commented 4 years ago

I think this is occurring at multiple locations through anonymised.aln.fasta.treefile but here is one example.

In the lineage B.2.1 there are 5 representative sequences according to the CSV: EPI_ISL_419791, EPI_ISL_419792, EPI_ISL_419793, EPI_ISL_419794, EPI_ISL_419797

Excluding 5' and 3' UTRs, these 5 sequences share the following 4 SNPs relative to Wuhan-Hu-1 (EPI_ISL_402125): G26144T, G11083T, C2558T, C14805T

There are then 4 additional non-shared SNPs in these sequences: A21137G - EPI_ISL_419792 G23984A - EPI_ISL_419794 A2480G - EPI_ISL_419797 C19763T - EPI_ISL_419797 So, we would expect to see two sequences (EPI_ISL_419791, EPI_ISL_419793) on zero length branches from the root of this lineage, 2 sequences each on a branch representing a single SNP (EPI_ISL_419792, EPI_ISL_419794) and finally one sequence on a branch representing 2 SNPs (EPI_ISL_419797).

But in anonymised.aln.fasta.treefile we actually see this:

Screenshot 2020-05-03 at 10 33 35

i.e. 3 sequences on zero length branches, and only one on a single SNP branch.

aineniamh commented 4 years ago

I refer this to #6 as it's discussed there and am closing this issue.