Open corneliusroemer opened 3 years ago
Speed is one of the main reasons. In an ideal world, I'd love to be running both approaches and cross-referencing them to give a full-picture lineage assignment. Conflicting assignments would also help us to troubleshoot.
The GISAID assignment pipeline runs the entire database through the latest model each time it's updated (as it should) and it wouldn't be feasible to run the UShER mode of pangolin as the main mode for this.
In terms of which is most up-to-date, the UShER and pangoLEARN models should be updated on the same timeframe- @AngieHinrichs updates the protobuf file when we run a new model training as far as I'm aware. Models are trained on a biweekly basis, but we plan to move towards doing this once a week or in line with new designations.
The latest pangoLEARN training has actually had a big increase in accuracy too, with developments to the training pipeline cutting out a lot of the noise from designations that are missing key SNPs etc, and we're exploring other approaches of assigning lineages going forward. We don't have any major attachment for continuity reasons, if we (or others) find another approach that's better and can assign that many sequences in a similar timeframe, we'd be happy to incorporate it! @emilyscher has put a lot of work into trialling alternative ML models and the decision tree keeps coming out on top!
I would add that @AngieHinrichs does usually push the latest trained usher protobuf before we push the trained pangoLEARN model and release. Some of the delay with us releasing is to do with running checks - we are aware that the moment we release a pangoLEARN model, many people pull in the changes immediately. We like to be aware of any lineages which see big jumps in numbers assigned to/from them with each release before it explodes on twitter and to investigate them if unexpected.
Thank you @aineniamh and @rmcolq. @corneliusroemer, the pangolin --usher assignments use an UShER tree file in the pangoLEARN repository and I've been trying to keep that in sync with the version of pango-designation used to train pangoLEARN before each pangoLEARN release. You might see some new lineage assignments appearing in advance of pangoLEARN releases (which go through more testing as Rachel said) in the UShER web interface.
Another general observation: UShER is much better at calling more recent lineages (say from ~June 2021) than earlier lineages (say ~March 2021 and before) since more recent lineages have been defined with input from the UCSC/UShER tree, so they are monophyletic on that tree, while earlier lineages were called on COG-UK trees and some are not monophyletic on the UCSC tree. There are some lineages (20ish?) that just can't be assigned satisfactorily to a node of the UCSC/UShER tree, so can't be assigned by pangolin --usher. I've been meaning for months to ask for some lineages to be withdrawn & others updated in pango-designation but... time & priorities.
Regarding speed, it's true that --usher mode is significantly slower than pangoLEARN, although it can take better advantage of multiple CPU cores if elapsed time is the highest priority. Here are the results of a quick comparison: for a set of 10,000 randomly selected lineage representative sequences, using the --skip-designation-hash
option to make sure that pangoLEARN or usher was run for each sequence, these are the approximate run times in minutes averaged across 3 runs at several thread counts (-t
):
Threads (-t) | default (pangoLEARN) | --usher |
---|---|---|
1 (default) | 5 | 80 |
4 | 3.5 | 30.5 |
16 | 3.1 | 12.3 |
64 | 2.8 | 6.6 |
GISAID would be running pangolin without the --skip-designation-hash
option of course. The designation hash would bypass pangoLEARN/usher for the 788k representative sequences in pango-designation/lineages.csv. GISAID currently has 4,456,471 sequences, and of course that will continue to grow. For simplicity's sake, say ~3.7 million sequences need to be evaluated by pangoLEARN or usher after the next pangoLEARN release. Multiplying 3.7M by the run times per 10k sequences in the table above, we get a very back-of-the-envelope estimate of these CPU minutes at 3.7 million sequences:
Threads (-t) | default (pangoLEARN) | --usher |
---|---|---|
1 (default) | 1850 | 29600 |
4 | 1295 | 11285 |
16 | 1147 | 4551 |
64 | 1036 | 2442 |
So, wandering further into back-of-the-envelope land, how do those translate into $ on e.g. an Amazon m5n.16xlarge (64 CPUs, $0.6582 per hour spot instance = $0.01097 per minute), assuming for simplicity 64 1-thread jobs, 16 4-thread jobs and so on running simultaneously?
Threads (-t) | default (pangoLEARN) | --usher |
---|---|---|
1 (default) | 1850 * 0.01097 / 64 = $0.317102 | 29600 * 0.01097 / 64 = $5.073625 |
4 | 1295 * 0.01097 / 16 = $0.887884 | 11285 * 0.01097 / 16 = $7.737278 |
16 | 1147 * 0.01097 / 4 = $3.145648 | 4551 * 0.01097 / 4 = $12.481117 |
64 | 1036 * 0.01097 = $11.364920 | 2442 * 0.01097 = $26.788740 |
That was very hasty and I probably made some mistakes in there (not to mention oversimplifying), but hopefully not off by too many orders of magnitude. I have no idea what kind of compute resources GISAID uses, but if they are already using a cloud, then I think it shouldn't break the bank for GISAID to use pangolin --usher -t 64
, even if they use 2 or 3 times as many instances to get the same elapsed time as pangolin
. (Not that I'm about to volunteer my credit card though ;)
Very interesting, thanks @AngieHinrichs for the deep dive! So you don't see a reason why GISAID couldn't use --usher
mode? I'd very much appreciate if there were fewer misclassifications like AY.4 popping up everywhere. Have you done a benchmark of correctness of assignments?
On a side note, over the weekend I've worked on an old idea of mine, making a naive Bayes classifier. It works very well without much tuning, it's just a little bit slow since calculations are quite brute force, reducing a 1440x180k matrix to a vector of 1440 rows (each pango lineage its own row, 180k=30k nucs * 6 one hot encodings (A,C,G,T,-,N), but not optimised at all.
It may be possible to reduce runtime by a factor of nearly 1000 by doing a sparse calculation adding up only those columns of the matrix that the sequence deviates from the reference on, so 1440x50 additions instead of 1440x180k.
It's quite early stage, I just started over the weekend. This project can be used for other things, like extracting defining mutations for each pango lineage automatically. This could be very useful for human review of assignments. I remember @matt-sd-watson asked for this a while ago here: https://github.com/cov-lineages/pangoLEARN/issues/29
The repo isn't properly documented at all, but if you're curious you may get the gist by going through the create_matrix.py
file: https://github.com/corneliusroemer/lineage_classifier/blob/master/create_matrix.py
Actually I see that the mutations can simply be extracted from your tree, just no one seems to have published that yet. Will have a look next weekend, this sounds like a very handy thing to have around: https://github.com/cov-lineages/pangoLEARN/issues/29#issuecomment-935100782
I've experimented with Naive Bayes classifiers for pangolin in the past, but wasn't able to achieve the same speed or recall we can get with other models, even using quite stringent feature selection. However, if you're able to do so I'd be very interested in hearing about it! Keep me updated!
For info, at Sanger we found that UShER mode (which we love) with the latest mini-tree wasn't capturing AY.4.2 all that well for some reason (missing 30% of seqs). Obviously that will now be solved with Scorpio but it was a slight surprise to me anyway because I'd expected the fairly reliable 222V and the solid tree (at least on the big one) would work. Does anyone have any insights?
Edit: I've checked the public-latest.metadata.tsv.gz
which has these right so I'm going to see if there's a problem at our end..
Edit 2: It's not as simple as problems at our end I think - gonna take this to email with @AngieHinrichs to avoid noise in this thread and can report back a summary
Edit 3: at least in part it was a problem on our end (old designation hashes) - so the strikethrough's back :)
@emilyscher Interesting, did you use the full blown alignment, and log of probability? Do you have some code somewhere, maybe? Would be interested in seeing what you tried! Speed, yes, that's a bit of an issue, but one can try sparse first and if the result is not unambiguous do full matrix multiplication on close matches.
@theosanderson Thanks, that's very interesting! I should look into how the Usher classification tree looks like, one can probably do some diagnostics to find out what went wrong.
@all: Does anyone share results of pangoLEARN and Usher anywhere, at least for Genbank that should be allowed publicly. It'd be interesting to compare the two and see whether one can improve on any one of them through bagging. Maybe throw in naive bayes and we have 3 to aggregate? It may also be helpful to report assignment certainty quantitatively, I couldn't see that in the pangoLEARN output, is that something that could be added easily? Some sort of confidence? If confidence is below a given threshold one could then assign the parent. That could help with the AY.4 false positive problem. It's much less harmful to wrongly call an AY.4 as B.1.617.2 than calling B.1.617.2 as AY.4 (at least in my eyes). Even better would be if one would not call it as B.1.617.2 (that is definitely not any other AY) but as B.1.617.2* [Delta but not sure exactly which].
Public calls:
(BTW to join the party I also have my own lineage classification project using neural nets which I think is looking promising, will post the link very soon, just training the next model)
Fab @theosanderson, looking forward to your neural net! I'll try adding it to Nextclade which would work similarly to Usher with the advantage of being entirely for free (placement already happens for clade assignment anyways).
@theosanderson I would love to take a look at failed AY4.2 calls if you could send me some example sequence names/IDs. [I wonder if requiring S:145H (as in #cov-lineages/pango-designation/issues/223) was causing trouble for ARTIC V3 sequences?]
Just to complicate things further, I think the example problem cases of AY.4 and AY.4.2 will both change dramatically with the latest versions of pangoLEARN and scorpio+constellations. We have added a constellation for both. This makes that problematic mutation site a prerequisite for AY.4 and should handle the amplicon dropout case for AY.4.2 both in usher and pangoLEARN mode.
To report back on our potential UShER issue - it was at our end. We were running in a weird configuration (which one would get from a fresh pangolin install between 15 Oct and 19 Oct) with the old lineage designation hashes but the new UShER tree. So old lineage designations would replace many of the (good) UShER results. Thanks to @AngieHinrichs for a lot of help with identifying this, and sorry for the confusion -- we did know that PangoLearn and UShER would be out of sync but didn't understand the full implications.
Yes - the hashes are generated during pangoLEARN training and make sure that sequences which are identical to designated sequences get the designated lineage. They were out of sync for longer than usual this time around because of solving the AY.4 and AY.4.2 problems. I guess you have to weigh up the pros and cons of having early access to usher tree by pulling tip of master vs pulling releases.
Definitely more cons than pros, at least when running with hash-mode on :) . Our mistake, and thanks all.
But in other news: I think the neural net is somewhat ready for inspection: https://github.com/theosanderson/armadillin (working name in tribute to pangolin, but can be changed if it's not appreciated -- and definitely keen on the integration option - let's talk!)
If anyone wants to try it out I'd be very keen to hear any experiences. Basically the idealised aim is to combine the speed of Pangolin with the accuracy of UShER. It gets through 2000 sequences per second on my machine without using too much CPU. But it's early days and I've mostly tested it on COG UK data so it's possible it's overfitting. (This model is trained on pango-designations master just after AY.4.2 designation so doesn't have more recent lineages).
Hi @theosanderson,
Thanks for posting, I'm excited to take a look! First I have a few thoughts, would be great to get some clarification
I hope that that's helpful, let me know if you'd like to discuss this further!
PS Forgot to ask how you're doing feature selection. This is something I'm constantly messing with so it would be very useful to hear any ideas you might have!
Hey @emilyscher ! Thanks for the interest! And for the helpful points!
As I said it's super early days and it's possible there are catastrophic bugs, etc. etc.
What is the current training time?
The current training graph on a GPU looks like this:
GPUs are obviously a pain but I haven't done much to optimise this and I think it could get a lot better.
I'm not sure if I am training on a smaller training set. Or at least I don't think training set size is the issue (it's arbitrarily chosen atm, rather than to limit training time). One thing I did struggle with in decision terms was how to renormalise the dataset. I do something where I scale down the most common lineages a lot, and resample the rare ones, but not to the extent of making a train set with totally equal numbers for each lineage. In general there's a lot more to do to validate how it does on rare lineages with few designations.
How are you validating your model, and what statistics are you generating?
Yes, so I'm mostly looking at precision recall and F1. Right now I don't have a good train/test split for proper validation. (Just sharing early to give a sense of what things look like and have these discussions)
One of the reasons I've stayed away from neural nets for the moment (other than training time), is that they're very difficult to interpret.
Yes, this makes sense. I have done stuff on this in the past so do have ideas on how interpretation can work, but I agree it's a good concern, and also an interesting thing to look at.
Overfitting
Thanks for your instincts - yes basically I just wanted to generally caveat that this is early days and there may be horrors lurking.
Feature selection
Re feature selection, I do this after training. What I do is to train a neural net on all of the one-hotted sequences. I have an initial layer that multiplies the one-hots by learnt weights (i.e. a single variable for each feature that it gets multiplied by). I have an L1 loss on those weights, and the intuition (I'm not actually sure how important the loss is) is that this will mean that these weights are in general pushed down towards 0 , except where that feature is important.
Then once this initial network is trained I do another round of training (which is super fast) to sparsify that multiplication layer. I.e. the weights that were close to 0 get actually set to 0, during training so the network can adapt if needed. I set 95% of these to 0.
Then essentially the features where the weights are still non-zero are the features. I can than change the dense layer below to only do the calculations for those features.
It's kinda hard to explain but I hope that makes a little sense, and happy to discuss much more.
Thanks again :)
Edit: Oh something else I do is to randomly drop out 10% of features during each training step (i.e. simulating an N)
Hiya, I put a COG-UK -> PangoUsher (and PangoLearn, but the definite source for that is the COG-UK dataset) mapping for Sanger data up at https://github.com/theosanderson/lineage_comparison in case it's useful for analysing trade-offs etc.
P.S. Armadillin defs still has some work to do - I got worried about rare lineages but am probably now oversampling them hence B.1.1.7 -> Q.1 artifacts - wip
Whenever I run Usher placements I notice that the Usher pango assignments are not only more up to date (they don't need to wait for PangoLEARN release) but also more reliable (at least that's what it looks like to me).
What's the reason that Usher mode is not promoted as the primary method, to be used by GISAID etc.? Is there a downside to Usher mode? Or is it just for continuity reasons that the decision trees are still used by GISAID?