broadinstitute / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
176 stars 88 forks source link

Strategies for reloading probands as trios #1831

Closed tommyli closed 3 years ago

tommyli commented 3 years ago

Strategies for reloading probands as trios

In our MCRI Seqr instance, we have the following scenario.

Researchers now have data for some parents of these individuals and would like to analyse some of them as trios, around

We would like to consult your team on best strategy to load these in?

Re-associating individuals with new individuals and grouping as family should just be a matter of uploading a new pedigree file using Seqr's bulk edit individual functionality. However, how should we load the new VCFs in? Using which genome version? Creating new index just for the parents and associating to same project? Joint call and reload again as trios? What about existing saved variant details?

Here's what we have in mind.

Option 1 - keep as GRCh37

Option 2 - reload trios as GRCh38

Option 3 - re-create and reload whole project as GRCh38

We were thinking of Option 2. Any comments and/or suggestions?

Questions regarding lift_project_to_hg38.py script:

  1. The --es-index argument is expecting the new ES index in GRCh38 and NOT the old index right?
  2. How reliable is this admin script in practise?
hanars commented 3 years ago

seqr does not support having multiple genome versions within the same project, so option 2 is not really going to work for you. If you want to try to modify the code to allow projects to work with multiple genome versions you can, but I strongly recommend against that.

You also need all family members to exist within the same index for inheritance search to work. So in option 1, you would have to make a GRCh37 index with the parents AND their children in it.

Option 3 is the approach we have used when we started getting new samples in aligned to GRCh38, and in general we find theres better disk space usage with one index anyways.

The TL;DR is you are going to need to create one index with parents and children together, regardless of which genome version you use. In theory, the pipeline can take a comma separated list of VCFs, so you can theoretically do that with the 2 existing VCFs without needing to joint call again. However, the callset AF will be more accurate if you joint call them all together, so up to you how much you rely on that data.

In terms of which genome version to make your new index in, thats really up to you. For us, all our new data is being delivered as GRCh38 so once that started happening we decided it was easier to switch the project genome versions over. If you don't anticipate getting any new samples for this project, or you anticipate getting them aligned to GRCh37, I don't see a real good reason to switch. But if you do anticipate getting new data in on GRCh38, especially if its data that you would want to add incrementally (i.e. some new probands/ trios that are unrelated to your current data set that you won't joint call with the existing data) then I would recommend switching your project to GRCh38 now, as lift_project_to_hg38.py requires an index containing all the samples in the project.

Notes on lift_project_to_hg38.py:

tommyli commented 3 years ago

Thanks for your elaborate and helpful response @hanars, I'm sure we'll refer to this issue all the time whenever we hit tricky loading scenarios.

I've updated Option 1 to be more clear that parents and proband must belong to one index.

Regarding callset AF, our projects are typically small so we have our default filters configured callset AF to 1 (i.e. ignored). I take your point though, if we joint call everything in the project then some projects will have somewhat useful callset data to refer to.

I will further consult our researchers and decide on what to do. We may have more questions down the track. Either way, we'll update here on what we end up doing.