corneliusroemer / pango-sequences

Consensus sequences for each Pango lineage
18 stars 1 forks source link

Missing lineages from pango-consensus-sequences_genome-nuc.fasta.zst #7

Open peterwc-cdc opened 3 months ago

peterwc-cdc commented 3 months ago

Current Behavior
There appears to be missing lineages from the https://github.com/corneliusroemer/pango-sequences/blob/main/data/pango-consensus-sequences_genome-nuc.fasta.zst file that are present in the JSON.

A total of 1222 appear to be missing.

Expected behavior
Is there supposed to be one representative for each lineage?

How to reproduce
Steps to reproduce the current behavior:
Compare the JSON summary file to the genome.zst

Possible solution
Are these supposed to be missing? If so we will accept, but it would be nice if they could be added.

Your environment: if browsing Nextstrain online
Downloading and using data file from Github

Let me know if you would like a complete list.

peterwc-cdc commented 2 months ago

@corneliusroemer Is this an issue that you have noticed?

corneliusroemer commented 2 months ago

Hi Peter, thanks for opening the issue (always welcome!) and sorry for my delay in replying (I am still at a workshop).

I currently only include lineages that have at least 3 genomes available in genbank, which often means that newer lineages won't be included. I'll see whether there's an additional bug.

It would be great if you could share the full list so I can check if there's anything unexpected showing up!