tommyli commented 3 years ago

Strategies for reloading probands as trios

In our MCRI Seqr instance, we have the following scenario.

Existing project with 102 probands/individuals (i.e. 102 families, each with one individual)
All are loaded as Genome Version GRCh37
All are exomes
Many tags/notes (saved variants) exists already for this project

Researchers now have data for some parents of these individuals and would like to analyse some of them as trios, around

We would like to consult your team on best strategy to load these in?

Re-associating individuals with new individuals and grouping as family should just be a matter of uploading a new pedigree file using Seqr's bulk edit individual functionality. However, how should we load the new VCFs in? Using which genome version? Creating new index just for the parents and associating to same project? Joint call and reload again as trios? What about existing saved variant details?

Here's what we have in mind.

Option 1 - keep as GRCh37

Reload parents and associated proband into new, single ES index
Ideally, the three samples are join called into single VCF and loaded as new, single ES index (per trio)
Even more ideal is to merge all new parents and associated probands into single VCF and loaded as new, single ES index (all trios)

Option 2 - reload trios as GRCh38

Joint call probands and parents together and reload as GRCh38 into new ES index. All trios will now be in one index whilst other probands will still be part of old index.
Use lift_project_to_hg38.py to migrate saved variants to GRCh38. May need to modify this to only include applicable probands that are forming new trios.

Option 3 - re-create and reload whole project as GRCh38

Joint call all 102 probands and parents together and reload as GRCh38 into new ES index. All individuals and trios will now be in one index.
Use lift_project_to_hg38.py to liftover all saved variants.

We were thinking of Option 2. Any comments and/or suggestions?

Questions regarding lift_project_to_hg38.py script:

The --es-index argument is expecting the new ES index in GRCh38 and NOT the old index right?
How reliable is this admin script in practise?

hanars commented 3 years ago

seqr does not support having multiple genome versions within the same project, so option 2 is not really going to work for you. If you want to try to modify the code to allow projects to work with multiple genome versions you can, but I strongly recommend against that.

You also need all family members to exist within the same index for inheritance search to work. So in option 1, you would have to make a GRCh37 index with the parents AND their children in it.

Option 3 is the approach we have used when we started getting new samples in aligned to GRCh38, and in general we find theres better disk space usage with one index anyways.

The TL;DR is you are going to need to create one index with parents and children together, regardless of which genome version you use. In theory, the pipeline can take a comma separated list of VCFs, so you can theoretically do that with the 2 existing VCFs without needing to joint call again. However, the callset AF will be more accurate if you joint call them all together, so up to you how much you rely on that data.

In terms of which genome version to make your new index in, thats really up to you. For us, all our new data is being delivered as GRCh38 so once that started happening we decided it was easier to switch the project genome versions over. If you don't anticipate getting any new samples for this project, or you anticipate getting them aligned to GRCh37, I don't see a real good reason to switch. But if you do anticipate getting new data in on GRCh38, especially if its data that you would want to add incrementally (i.e. some new probands/ trios that are unrelated to your current data set that you won't joint call with the existing data) then I would recommend switching your project to GRCh38 now, as lift_project_to_hg38.py requires an index containing all the samples in the project.

Notes on lift_project_to_hg38.py:

Yes, the --es-index argument is expecting the new ES index in GRCh38
We haven't used the script in a while, but we added pretty robust unit tests to it in the hopes that that would keep us from breaking it, and those are all still passing. When I wrote it and used it originally it was incredibly reliable
the script will be unable to lift some small number of variants. This is anticipated- it has to do with the fact that calls change (i.e. we had a couple cases where something was called as 2 mutation 1bp apart instead of a single 3bp mutation or stuff like that). Our policy was to go through and try to do searches to find any variants that didn't automatically lift if they had an important tag (like a discovery tag), but not if the were flagged for review or excluded. But thats really up to your team

tommyli commented 3 years ago

Thanks for your elaborate and helpful response @hanars, I'm sure we'll refer to this issue all the time whenever we hit tricky loading scenarios.

I've updated Option 1 to be more clear that parents and proband must belong to one index.

Regarding callset AF, our projects are typically small so we have our default filters configured callset AF to 1 (i.e. ignored). I take your point though, if we joint call everything in the project then some projects will have somewhat useful callset data to refer to.

I will further consult our researchers and decide on what to do. We may have more questions down the track. Either way, we'll update here on what we end up doing.

broadinstitute / seqr