Right now sequences are sampled roughly in a way that is proportional to abundance. Maybe this is best way to do things, but it may mean it takes a long time to see new antigenic variants arising. Would somehow sampling sequences by Pango classification be better?
I like this idea. I've implemented subsampling by Pango lineage in 4d6d7b7500122cf24f1480403a4d1e2a158acfba. I think emphasizing diversity to start with will be helpful.
Right now sequences are sampled roughly in a way that is proportional to abundance. Maybe this is best way to do things, but it may mean it takes a long time to see new antigenic variants arising. Would somehow sampling sequences by Pango classification be better?