cov-lineages / pangoLEARN

Store of the trained model for pangolin to access.
GNU General Public License v3.0
55 stars 13 forks source link

Why is Usher mode not made primary mode for GISAID etc? #32

Open corneliusroemer opened 3 years ago

corneliusroemer commented 3 years ago

Whenever I run Usher placements I notice that the Usher pango assignments are not only more up to date (they don't need to wait for PangoLEARN release) but also more reliable (at least that's what it looks like to me).

What's the reason that Usher mode is not promoted as the primary method, to be used by GISAID etc.? Is there a downside to Usher mode? Or is it just for continuity reasons that the decision trees are still used by GISAID?

aineniamh commented 3 years ago

Speed is one of the main reasons. In an ideal world, I'd love to be running both approaches and cross-referencing them to give a full-picture lineage assignment. Conflicting assignments would also help us to troubleshoot.

The GISAID assignment pipeline runs the entire database through the latest model each time it's updated (as it should) and it wouldn't be feasible to run the UShER mode of pangolin as the main mode for this.

In terms of which is most up-to-date, the UShER and pangoLEARN models should be updated on the same timeframe- @AngieHinrichs updates the protobuf file when we run a new model training as far as I'm aware. Models are trained on a biweekly basis, but we plan to move towards doing this once a week or in line with new designations.

The latest pangoLEARN training has actually had a big increase in accuracy too, with developments to the training pipeline cutting out a lot of the noise from designations that are missing key SNPs etc, and we're exploring other approaches of assigning lineages going forward. We don't have any major attachment for continuity reasons, if we (or others) find another approach that's better and can assign that many sequences in a similar timeframe, we'd be happy to incorporate it! @emilyscher has put a lot of work into trialling alternative ML models and the decision tree keeps coming out on top!

rmcolq commented 3 years ago

I would add that @AngieHinrichs does usually push the latest trained usher protobuf before we push the trained pangoLEARN model and release. Some of the delay with us releasing is to do with running checks - we are aware that the moment we release a pangoLEARN model, many people pull in the changes immediately. We like to be aware of any lineages which see big jumps in numbers assigned to/from them with each release before it explodes on twitter and to investigate them if unexpected.

AngieHinrichs commented 3 years ago

Thank you @aineniamh and @rmcolq. @corneliusroemer, the pangolin --usher assignments use an UShER tree file in the pangoLEARN repository and I've been trying to keep that in sync with the version of pango-designation used to train pangoLEARN before each pangoLEARN release. You might see some new lineage assignments appearing in advance of pangoLEARN releases (which go through more testing as Rachel said) in the UShER web interface.

Another general observation: UShER is much better at calling more recent lineages (say from ~June 2021) than earlier lineages (say ~March 2021 and before) since more recent lineages have been defined with input from the UCSC/UShER tree, so they are monophyletic on that tree, while earlier lineages were called on COG-UK trees and some are not monophyletic on the UCSC tree. There are some lineages (20ish?) that just can't be assigned satisfactorily to a node of the UCSC/UShER tree, so can't be assigned by pangolin --usher. I've been meaning for months to ask for some lineages to be withdrawn & others updated in pango-designation but... time & priorities.

Regarding speed, it's true that --usher mode is significantly slower than pangoLEARN, although it can take better advantage of multiple CPU cores if elapsed time is the highest priority. Here are the results of a quick comparison: for a set of 10,000 randomly selected lineage representative sequences, using the --skip-designation-hash option to make sure that pangoLEARN or usher was run for each sequence, these are the approximate run times in minutes averaged across 3 runs at several thread counts (-t):

Threads (-t) default (pangoLEARN) --usher
1 (default) 5 80
4 3.5 30.5
16 3.1 12.3
64 2.8 6.6

GISAID would be running pangolin without the --skip-designation-hash option of course. The designation hash would bypass pangoLEARN/usher for the 788k representative sequences in pango-designation/lineages.csv. GISAID currently has 4,456,471 sequences, and of course that will continue to grow. For simplicity's sake, say ~3.7 million sequences need to be evaluated by pangoLEARN or usher after the next pangoLEARN release. Multiplying 3.7M by the run times per 10k sequences in the table above, we get a very back-of-the-envelope estimate of these CPU minutes at 3.7 million sequences:

Threads (-t) default (pangoLEARN) --usher
1 (default) 1850 29600
4 1295 11285
16 1147 4551
64 1036 2442

So, wandering further into back-of-the-envelope land, how do those translate into $ on e.g. an Amazon m5n.16xlarge (64 CPUs, $0.6582 per hour spot instance = $0.01097 per minute), assuming for simplicity 64 1-thread jobs, 16 4-thread jobs and so on running simultaneously?

Threads (-t) default (pangoLEARN) --usher
1 (default) 1850 * 0.01097 / 64 = $0.317102 29600 * 0.01097 / 64 = $5.073625
4 1295 * 0.01097 / 16 = $0.887884 11285 * 0.01097 / 16 = $7.737278
16 1147 * 0.01097 / 4 = $3.145648 4551 * 0.01097 / 4 = $12.481117
64 1036 * 0.01097 = $11.364920 2442 * 0.01097 = $26.788740

That was very hasty and I probably made some mistakes in there (not to mention oversimplifying), but hopefully not off by too many orders of magnitude. I have no idea what kind of compute resources GISAID uses, but if they are already using a cloud, then I think it shouldn't break the bank for GISAID to use pangolin --usher -t 64, even if they use 2 or 3 times as many instances to get the same elapsed time as pangolin. (Not that I'm about to volunteer my credit card though ;)

corneliusroemer commented 3 years ago

Very interesting, thanks @AngieHinrichs for the deep dive! So you don't see a reason why GISAID couldn't use --usher mode? I'd very much appreciate if there were fewer misclassifications like AY.4 popping up everywhere. Have you done a benchmark of correctness of assignments?

On a side note, over the weekend I've worked on an old idea of mine, making a naive Bayes classifier. It works very well without much tuning, it's just a little bit slow since calculations are quite brute force, reducing a 1440x180k matrix to a vector of 1440 rows (each pango lineage its own row, 180k=30k nucs * 6 one hot encodings (A,C,G,T,-,N), but not optimised at all.

It may be possible to reduce runtime by a factor of nearly 1000 by doing a sparse calculation adding up only those columns of the matrix that the sequence deviates from the reference on, so 1440x50 additions instead of 1440x180k.

It's quite early stage, I just started over the weekend. This project can be used for other things, like extracting defining mutations for each pango lineage automatically. This could be very useful for human review of assignments. I remember @matt-sd-watson asked for this a while ago here: https://github.com/cov-lineages/pangoLEARN/issues/29

The repo isn't properly documented at all, but if you're curious you may get the gist by going through the create_matrix.py file: https://github.com/corneliusroemer/lineage_classifier/blob/master/create_matrix.py

Actually I see that the mutations can simply be extracted from your tree, just no one seems to have published that yet. Will have a look next weekend, this sounds like a very handy thing to have around: https://github.com/cov-lineages/pangoLEARN/issues/29#issuecomment-935100782

emilyscher commented 3 years ago

I've experimented with Naive Bayes classifiers for pangolin in the past, but wasn't able to achieve the same speed or recall we can get with other models, even using quite stringent feature selection. However, if you're able to do so I'd be very interested in hearing about it! Keep me updated!

theosanderson commented 3 years ago

For info, at Sanger we found that UShER mode (which we love) with the latest mini-tree wasn't capturing AY.4.2 all that well for some reason (missing 30% of seqs). Obviously that will now be solved with Scorpio but it was a slight surprise to me anyway because I'd expected the fairly reliable 222V and the solid tree (at least on the big one) would work. Does anyone have any insights?

Edit: I've checked the public-latest.metadata.tsv.gz which has these right so I'm going to see if there's a problem at our end.. Edit 2: It's not as simple as problems at our end I think - gonna take this to email with @AngieHinrichs to avoid noise in this thread and can report back a summary Edit 3: at least in part it was a problem on our end (old designation hashes) - so the strikethrough's back :)

corneliusroemer commented 3 years ago

@emilyscher Interesting, did you use the full blown alignment, and log of probability? Do you have some code somewhere, maybe? Would be interested in seeing what you tried! Speed, yes, that's a bit of an issue, but one can try sparse first and if the result is not unambiguous do full matrix multiplication on close matches.

@theosanderson Thanks, that's very interesting! I should look into how the Usher classification tree looks like, one can probably do some diagnostics to find out what went wrong.

@all: Does anyone share results of pangoLEARN and Usher anywhere, at least for Genbank that should be allowed publicly. It'd be interesting to compare the two and see whether one can improve on any one of them through bagging. Maybe throw in naive bayes and we have 3 to aggregate? It may also be helpful to report assignment certainty quantitatively, I couldn't see that in the pangoLEARN output, is that something that could be added easily? Some sort of confidence? If confidence is below a given threshold one could then assign the parent. That could help with the AY.4 false positive problem. It's much less harmful to wrongly call an AY.4 as B.1.617.2 than calling B.1.617.2 as AY.4 (at least in my eyes). Even better would be if one would not call it as B.1.617.2 (that is definitely not any other AY) but as B.1.617.2* [Delta but not sure exactly which].

theosanderson commented 3 years ago

Public calls:

(BTW to join the party I also have my own lineage classification project using neural nets which I think is looking promising, will post the link very soon, just training the next model)

corneliusroemer commented 3 years ago

Fab @theosanderson, looking forward to your neural net! I'll try adding it to Nextclade which would work similarly to Usher with the advantage of being entirely for free (placement already happens for clade assignment anyways).

AngieHinrichs commented 3 years ago

@theosanderson I would love to take a look at failed AY4.2 calls if you could send me some example sequence names/IDs. [I wonder if requiring S:145H (as in #cov-lineages/pango-designation/issues/223) was causing trouble for ARTIC V3 sequences?]

rmcolq commented 3 years ago

Just to complicate things further, I think the example problem cases of AY.4 and AY.4.2 will both change dramatically with the latest versions of pangoLEARN and scorpio+constellations. We have added a constellation for both. This makes that problematic mutation site a prerequisite for AY.4 and should handle the amplicon dropout case for AY.4.2 both in usher and pangoLEARN mode.

theosanderson commented 3 years ago

To report back on our potential UShER issue - it was at our end. We were running in a weird configuration (which one would get from a fresh pangolin install between 15 Oct and 19 Oct) with the old lineage designation hashes but the new UShER tree. So old lineage designations would replace many of the (good) UShER results. Thanks to @AngieHinrichs for a lot of help with identifying this, and sorry for the confusion -- we did know that PangoLearn and UShER would be out of sync but didn't understand the full implications.

rmcolq commented 3 years ago

Yes - the hashes are generated during pangoLEARN training and make sure that sequences which are identical to designated sequences get the designated lineage. They were out of sync for longer than usual this time around because of solving the AY.4 and AY.4.2 problems. I guess you have to weigh up the pros and cons of having early access to usher tree by pulling tip of master vs pulling releases.

theosanderson commented 3 years ago

Definitely more cons than pros, at least when running with hash-mode on :) . Our mistake, and thanks all.

theosanderson commented 3 years ago

But in other news: I think the neural net is somewhat ready for inspection: https://github.com/theosanderson/armadillin (working name in tribute to pangolin, but can be changed if it's not appreciated -- and definitely keen on the integration option - let's talk!)

If anyone wants to try it out I'd be very keen to hear any experiences. Basically the idealised aim is to combine the speed of Pangolin with the accuracy of UShER. It gets through 2000 sequences per second on my machine without using too much CPU. But it's early days and I've mostly tested it on COG UK data so it's possible it's overfitting. (This model is trained on pango-designations master just after AY.4.2 designation so doesn't have more recent lineages).

emilyscher commented 3 years ago

Hi @theosanderson,

Thanks for posting, I'm excited to take a look! First I have a few thoughts, would be great to get some clarification

  1. What is the current training time? I understand you're using a smaller training set than we are, so it would be good to get an idea of how often your model could be updated with the full training set
  2. How are you validating your model, and what statistics are you generating? At the moment we publish recall/precision/f1 stats for pangoLEARN, though I'm also generating more for my own validation. It would be useful to know how you've been comparing your model to the decision tree. At the moment, the decision tree has accuracy, precision, and recall all in the very high 90s, so its quite a high threshold to beat, but I'm sure it can be done!
  3. One of the reasons I've stayed away from neural nets for the moment (other than training time), is that they're very difficult to interpret. Have you tried to interpret any of the learned parameters, and has this yielded anything interesting? I know people have different preferred ways of going about this, so it would be great to discuss this further with you in the future.
  4. Overfitting is, in this case and in my opinion, not a massive problem. The decision tree (like most) is overfit, but that in turn means it does a great job of representing exactly the underlying tree structure behind the data. The training data is so well curated that we don't have to worry about some of the negative effects of over fitting, though of course some of them can be problematic (for example, the decision tree handling unseen data in non-ideal ways). So for your model I would expect that overfitting is also not necessarily a massive problem, as long as you're able to scale it up to train on the entire data set, and retraining it is lightweight enough that it can be done for each data release. If the training time is so long that it can't be done very frequently, then it a different story.

I hope that that's helpful, let me know if you'd like to discuss this further!

emilyscher commented 3 years ago

PS Forgot to ask how you're doing feature selection. This is something I'm constantly messing with so it would be very useful to hear any ideas you might have!

theosanderson commented 3 years ago

Hey @emilyscher ! Thanks for the interest! And for the helpful points!

As I said it's super early days and it's possible there are catastrophic bugs, etc. etc.

What is the current training time?

The current training graph on a GPU looks like this: W B Chart 20_10_2021, 14_32_51

GPUs are obviously a pain but I haven't done much to optimise this and I think it could get a lot better.

I'm not sure if I am training on a smaller training set. Or at least I don't think training set size is the issue (it's arbitrarily chosen atm, rather than to limit training time). One thing I did struggle with in decision terms was how to renormalise the dataset. I do something where I scale down the most common lineages a lot, and resample the rare ones, but not to the extent of making a train set with totally equal numbers for each lineage. In general there's a lot more to do to validate how it does on rare lineages with few designations.

How are you validating your model, and what statistics are you generating?

Yes, so I'm mostly looking at precision recall and F1. Right now I don't have a good train/test split for proper validation. (Just sharing early to give a sense of what things look like and have these discussions)

One of the reasons I've stayed away from neural nets for the moment (other than training time), is that they're very difficult to interpret.

Yes, this makes sense. I have done stuff on this in the past so do have ideas on how interpretation can work, but I agree it's a good concern, and also an interesting thing to look at.

Overfitting

Thanks for your instincts - yes basically I just wanted to generally caveat that this is early days and there may be horrors lurking.

Feature selection

Re feature selection, I do this after training. What I do is to train a neural net on all of the one-hotted sequences. I have an initial layer that multiplies the one-hots by learnt weights (i.e. a single variable for each feature that it gets multiplied by). I have an L1 loss on those weights, and the intuition (I'm not actually sure how important the loss is) is that this will mean that these weights are in general pushed down towards 0 , except where that feature is important.

Then once this initial network is trained I do another round of training (which is super fast) to sparsify that multiplication layer. I.e. the weights that were close to 0 get actually set to 0, during training so the network can adapt if needed. I set 95% of these to 0.

Then essentially the features where the weights are still non-zero are the features. I can than change the dense layer below to only do the calculations for those features.

It's kinda hard to explain but I hope that makes a little sense, and happy to discuss much more.

Thanks again :)

Edit: Oh something else I do is to randomly drop out 10% of features during each training step (i.e. simulating an N)

theosanderson commented 3 years ago

Hiya, I put a COG-UK -> PangoUsher (and PangoLearn, but the definite source for that is the COG-UK dataset) mapping for Sanger data up at https://github.com/theosanderson/lineage_comparison in case it's useful for analysing trade-offs etc.

P.S. Armadillin defs still has some work to do - I got worried about rare lineages but am probably now oversampling them hence B.1.1.7 -> Q.1 artifacts - wip