PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
45 stars 20 forks source link

Lineage-specific alignment #494

Closed ArtPoon closed 10 months ago

ArtPoon commented 10 months ago

We align each genome to the WH1 reference genome using minimap2 to extract any mutational difference from the reference as a "feature". This was a reasonable approximation in the early stages of the pandemic. However, circulating genomes can now carry more than 100 mutations away from this reference, and there is a substrantial number of recurrent mutations (homoplasies).

Since we are using PANGO lineage assignments to partition genomes into separate sets at an early stage of the analysis workflow, it should be possible to use a lineage-defining genome sequence (or even the consensus of multiple such genomes) as a lineage-specific reference. The rest of the analysis should be the same (the feature vectors will be greatly reduced).

In fact, we are already selecting a representative genome for each lineage for building the time-scaled tree as a navigation tool. The tricky part is that we need to have the reference genome for any given lineage available at the initial data feed processing step.

ArtPoon commented 10 months ago

On the other hand, the shared features of genomes of a given lineage shouldn't affect the clustering (neighbor-joining) analysis, and managing thousands of reference genomes is probably more trouble than it's worth.