corneliusroemer / pango-sequences

Consensus sequences for each Pango lineage
18 stars 1 forks source link

Consensus sequences for each Pango lineage

This repository contains semi-automatically generated prototype sequences for each Pango lineage. These sequences are not real sequences in databases but are algorithmically constructed consensus sequences that try to represent the common ancestor sequence of that lineage. They are based on the sequences designated in the cov-lineages/pango-designation repository. There is some manual curation involved to overwrite erroneous sites - errors happen when a lineage's designated sequences have dropout or reversions. The algorithm used to create these sequences has a high threshold to allow reversions, almost all sequences need to be reverted, otherwise it's assumed the reversions are an artefact.

Please be aware that due to the semi-automatic generation of these synthetic sequences, they can contain errors.

The data in this repo is automatically updated every day.

For lineages that are derived from BA.2, BA.4 and BA.5*, there has been significant curation/overwriting, but for previous lineages the amount of curation is very limited. So do expect errors.

If you find errors, please open an issue here. The same is true if you have ideas what other data about lineages you would like to be included here.

The sequences contained here are the ones used in Nextclade reference trees and produced by code contained in the nextclade_data_workflows/sars-cov-2 repository.

Contents

The repository contains:

Caveats

If you need to be sure that a sequence is correct, e.g. when you're creating a Spike protein for an experiment, please double check using the pango designation issue (if such an issue exists) and the annotation on the Usher tree - which is independently curated by @AngieHinrichs. Also, see https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/utils/otto/sarscov2phylo/pango.clade-mutations.tsv for the paths extracted from the Usher tree for each lineage.