Make events genome build aware

bjhall commented 4 years ago

As you may have noticed, we made a premature decision to move to hg38, which we kind of already regret. :) But we're trying to push through, hopefully it will be slightly easier for other scout users to make the move after our many failures.

Anyway. I think there are a only a couple of major outstanding issues. This issue discusses one of them.

At the moment many "events" are genome build specific (since they are using genomic positions). Some of them need to be crossmatched between different cases. I am guessing at least global comments, acmg, mark_causatives, maybe some others.

If it should at all be allowed to have both hg19 and hg38 cases on the same scout server this needs to be addressed somehow.

I think the "easiest" solution would be to tag each event with a "genome_build" field, based on the genome_build of the case where the event was performed. And then only fetch events from that genome_build everywhere (I haven't checked but I am guessing this will require a lot of changes all over the code!).

This will, however, result in essentially two parallell databases of events, and it would be up the admin to do liftovers between them. Personally I would be fine with that since we will never run a hg19 case again after the switch, so I would only need to liftover once.

Any thoughts?

northwestwitch commented 4 years ago

Hi @bjhall, I wound't call your decision premature, if you start afresh then I think it's the best choice you can do, even if it might be painful at the beginning. One day or another you have to switch to the new build, and the more work you have done on the old, the more fixes it takes to port all these analyses to the new. This is exactly where we in Stockholm are: thousands of cases from the hospital runned in hg19 and planning to start new analyses with hg38, while still making the output of the 2 analyses compatible and usable for the users.. I don't know how we will solve this, it's till debated.

I'm really sorry for the difficulties you are facing but since this is a joined project please continue to share your ideas and suggestions.

We had a few discussions here about the switch and I don't think we ever really reached a conclusion. The best scenario of course would be to have all the cases in just one build. Options:

Re-run all cases with the new build (this will take a looong time)
Re-run unsolved cases with the new build (this will also take a long time but would help solving some of them because new variant calls, so perhaps it would be worth it).
Do a liftover for the variants and events for all cases (this feels like an approximative solution to a more complex problem).
Keep cases in two build with the option for the users to re-run cases of their choice and just accept that cross variant comparison between the 2 builds is not possible (as well loqusdb cross checks between build)
Create some system of liftover that would allow you to compare old cases with the new (HARD!)
A system that whenever you load a case in hg38 it saves also the variant coordinates (if found) in hg19 at the variant level (in a key similar to simple_id). One could have simple_id_38 and simple_id. That would be good because it would allow you to compare with the events, where you have the field named subject, that looks like simple_id.
..?

I'm not really sure that lifting over just the events would solve your problems. It feels a more complicated issue. But of course it would be nice to see if a suspect variant has already shown as pathogenic, even if in the old build. to answer your question I don't really know. Other ideas?

bjhall commented 4 years ago

Thanks for your input @northwestwitch!

I agree this is a huge issue and unfortunately we're not really starting afresh either. We have 150-200 WGS and >500 WES cases on hg19. We've decided to ignore the WES cases, but we've rerun most of the WGS cases on hg38, mostly to build up artefact databases. We will probably not load these old cases into Scout though.

We're trying to be as pragmatic as possible, but one thing that we really don't want to lose is the body of knowledge build up in the Scout database, in the form of ACMG classifications and global comments. Hence this issue... Rerunning cases won't really solve this problem. At this point some sort of lift-over of the events is required. But that would require the events to be genome_build version controlled. But I agree that the solution to this problem needs to be well-thought-out.

In Coyote (a quick and dirty Scout clone that I created years ago for our cancer analyses) I "solved" this problem by, in classifications/global comments, storing the HGVS notation of the variant in addition to the chromosomal location. This solution is not entirely unproblematic, but mostly solves the problem. It will probably be difficult to shoehorn this solution into Scout at this point though.

moonso commented 4 years ago

Hi! Like you say we are happy that you are encountering all of these problems before us:) Could we try to list exactly which ones that needs to be handled? Perhaps they will need different solutions when looking closely on them.

dnil commented 4 years ago

As @northwestwitch also says, we on our hand regret not having switched yet. Your process now is very convenient for us, and admittedly somewhat less stressful than having live hg38 cases waiting in line. Still, we are steadily producing hg19 ones that will have to be dealt with eventually.😅

hgvs is quite a decent genome build independent key, but will not work so well off the bat for SVs. Whole exon/gene del dups might be supported though. ClinVar would be another such, but with quite a bit more latency - and we may not want to submit all comments to ClinVar...

The liftOver card will almost inevitably be played at some point, even if it is imperfect. The question is really wether we do that once on like a bed/vcf dumped db, or create an internal live query function. I'm sorely tempted to do the latter, but I think it is the virtues of laziness and hubris speaking. Separation of genome versions is the prudent and boring way to go to reduce work and complexity in the long run.

To stop ranting, I propose to update the scout writer functions to include user added variant level content, and implement read functions for the latter. This could be done via a flag to the serialize functions (#1708). Liftover could then be done once for your own hg19 data on migrating, and possibly every time you interchange data with less advanced centers using old genome builds (cough-cough).

Clinical-Genomics / scout

Make events genome build aware #1824