cov-lineages / lineages-website

16 stars 13 forks source link

lineages.yml no longer being updated? #21

Closed joshuailevy closed 1 year ago

joshuailevy commented 1 year ago

Hi!

I've noticed that data/lineages.yml hasn't been updated in almost a month. Is this file going to continue to be updated (presumably by running grinch)?

Thanks again for this great resource! Josh

aineniamh commented 1 year ago

Ah okay, just noticing this now- our pipelines have been running to completion every three days, so this shouldn't be the case- i wonder if the file got too big so the github portion of it is breaking. I'll look into this! Thanks for flagging!

joshuailevy commented 1 year ago

Ok great! Thanks for looking into it!

aineniamh commented 1 year ago

Okay, yes it's exactly that problem- _data/lineage_data.json is now >70MB and github's largest file size is 50MB.

aineniamh commented 1 year ago

Okay, I've set git-lfs to track that particular file and reverted the commits that prevented it from pushing to github before. This should hopefully work from now on, but I'll keep an eye on it! 🤞

joshuailevy commented 1 year ago

Awesome, thanks again for sorting this out! Hopefully the next update doesn't push any other files over the limit :)

joshuailevy commented 1 year ago

Seems the updates might still be getting stuck?

jubileepower commented 1 year ago

With the latest pangolin data v1.16 release, we'd highly appreciate the yaml file to gain access to the hierarchy information. Many thanks!

aineniamh commented 1 year ago

I think it's a permissions issue as I configured github lfs in root, but it'd @rmcolq's account that's set up to push to it, so it didn't take. If it's just for the hierarchy, you can always use the installable alias.json file in pango-designation repo. Pangolin also has a flag that prints out that alias file too if that helps!

joshuailevy commented 1 year ago

Looks like things kind of went through? The lineages.yml file only includes A lineages though.

aineniamh commented 1 year ago

https://media.githubusercontent.com/media/cov-lineages/lineages-website/master/_data/lineage_data.json The file exists here I believe- we're still struggling to get github pages to play with the lfs though. Rachel has been debugging!

rmcolq commented 1 year ago

Page is live again, but restricted to lineages seen in the last year. We will evaluate if this is a problem over the coming weeks

TheZetner commented 1 year ago

This is better but we're still seeing a number of recent missing lineages (eg. BQ.1.1 descendants past BQ.1.1.13) compared to the latest pango update. There seems to be a good amount of trouble and interest surrounding this hierarchy of PANGO lineages. Maybe not the right place to make the suggestion but this relationship information has to exist in pangolin, right? Perhaps there should be a command line option like pangolin --hierarchy to output a simple YAML or Newick like we're seeing in multiple places online.

ktmeaton commented 1 year ago

Through this discussion, I discovered @corneliusroemer's library pango_aliasor which provides a super speedy way to get the parent-child relationships with proper aliases. I'm testing this as an alternative to using the lineages.yml file for the hierarchy. So far working well!

joshuailevy commented 1 year ago

In case it's of interest, I just threw together a quick script to assemble a hierarchy from some of the existing pango-designation resources. An up to date version of the output is available at https://github.com/outbreak-info/outbreak.info/blob/master/curated_reports_prep/lineages.yml . lineages.yml.txt lineages_hierarchy.py.txt

rmcolq commented 1 year ago

The recent missing lineages is because they have not been called by pangolin yet. Nextclade has started assigning pango-lineages using it's own algorithm based on the master/tip of pango-designation which might be some of the confusion. However the usher/pangoLEARN models for pangolin are trained on releases and this webpage uses pangolin assignments.

Edit: I'm going to back track on this - this is why the filter was excluding them (since when pangolin was run there were no sequences assigned to them), but they were featuring on the website because the website generation scripts were using the master of pango-designation

corneliusroemer commented 1 year ago

All you need for the hierarchy is pango-designation/pango-designation/alias_key.json

This is the sole source of truth for hierarchies. Nothing else is needed.

What sort of information do you who commented above need that you think is only contained on the website? I may be able to point you at a more convenient resource to solve your need.

@joshuailevy @jubileepower @TheZetner

joshuailevy commented 1 year ago

Agreed, once you have a list of all of the lineages and the alias key, the hierarchy is evident. However, there are some queries that are much faster/straightforward when you have an explicit hierarchy.

For example, if you want to look at all descendants of BA.5, you can of course convert all of the short-form names to complete names (e.g. B.1.1.529.5.3.1) and then do some string comparisons to pull all of the descendants, but if this is an operation you're doing a lot, there's a lot of wasted computation (especially given the number of lineages these days).

Similarly, the reverse operation (mapping from lineage to possibly distant parent) is easy, just requiring mapping the aliases and checking if a lineage descends from another lineage, but it's generally faster to do this check using a prebuilt hash/dictionary.

These are things that happen quite a bit in outbreak.info and freyja, but may not be as generally useful to other folks.

jubileepower commented 1 year ago

@corneliusroemer Yes and thanks for checking, I realized I can get the information I need when Áine mentioned the alias key in json format. I planned to script that with the list of lineages to extract what I need on Monday, as they are the upstream info from pangolin. @joshuailevy seems to have already saved me some time, thanks! I am in a similar situation as Josh mentioned. My tools either labels parent-child and sibling relations of lineage reassignment between models, or models the weekly proportions of aggregated variants (mutually exclusive descendant lineages) for near term predictions.

aineniamh commented 1 year ago

@jubileepower I meant to say actually I saw your lightning talk at VGE and very much enjoyed it! 🎉

TheZetner commented 1 year ago

Thanks for the feedback and resources @corneliusroemer and @joshuailevy. I have a similar use case as @jubileepower for genomic surveillance of individual lineages and their descendants. She's beaten me to the punch on it and I'll be cribbing off her work for my purposes. Thanks all!