arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

expand gene_wise functionality (comp_het, etc) #576

Open brentp opened 9 years ago

brentp commented 9 years ago

Currently, gene_wise simply counts variants in families, doesn't do any phasing. If we made it handle phasing as needed for comp_hets, then we could use the gene_wise machinery to drive all of the existing inheritance tools... It would also aid in looking for cases where we have a somatic variant in one sample and a germ-line in the corresponding blood sample. This will be a large change, would be good to understand all the use-cases. See @jxchong 's comment on #529

jxchong commented 8 years ago

Add ability to simultaneously identify families that have homozygous variants in a gene AND families that are compound het for variants in the same gene.

brentp commented 8 years ago

I'm thinking the way to do the comp_het-like thing with gene-wise is to add

--phase-family $family_id

that can be specified multiple times. For each time that was specified, there'd be 2 additional columns $family-members and $family-gts which would be comma-delimited e.g.

mom,dad,kid   G/T,G/G,T|G
mom,dad,kid   A/A,A/C,A|C

from there, it should be fairly straight-forward to add a column for each row (and for each family) that indicates when an affected got the alternate allele from different parents. e.g. assuming we have variant_ids 1,2,3 for the 3 rows respectively.

mom,dad,kid   G/T,G/G,T|G   3
mom,dad,kid   C/C,C/C,C/C
mom,dad,kid   A/A,A/C,A|C   1

So, it will output the same rows, but the user may decide to limit only to rows/genes for which there is a phase-pair.

This could get noisy if they request, e.g. 4 families but only 1 of them actually has a HET for the proband.

... Brainstorming...

So maybe phase-family should be $family_id:$filter so that $family will be phased if $filter number passes, e.g:

gemini gene_wise  \
    --columns "gene, chrom, start, end, ref, alt, impact, impact_severity" \
    --gt-filter "(gt_types.1_dad == HOM_REF and gt_types.1_kid == HET and gt_types.1_mom == HOM_REF)" \
    --gt-filter "(gt_types.1_dad == HET and gt_types.1_kid == HET and gt_types.1_mom == HET)" \
   --phase-family family_1:1,2

So in this case, it will phase family 1 if either filter passes?

This is getting complicated to understand and document. Probably should just leave --phase-family $family_id and let the user deal with the extra columns when they have a lot of families they want phased.

jxchong commented 8 years ago

This isn't a super helpful comment because I don't currently have a better solution but:

I prefer the first one. The problem is that (if I understand it correctly), the first one doesn't really complete the comphet analysis in that the user still has to write extra, non-trivial (not just grep) code to identify genes with 2+ variants. The user's script would have to iterate through the whole list of variants, store the list of genes and phase-state of all the variants in each gene and return the genes in which there are 2+ variants with desired phase state. If you have to write such a script, it's not clear to me if you really save any time vs just doing separate autosomal_recessive and comp_het analyses and writing a script to combine the results from the two to identify overlapping candidate genes.

brentp commented 8 years ago

So, instead user specifies --phase-family $family_id as many times as they like and a single extra column indicates all the family:variant combinations where the alternate allele is from a different parent than the current. e.g.

family_1:33|family_2:21,33

then the user just has to grep for that column being non-empty.

The problem remains that, we want to be able to write something like:

--gt-filter "(gt_types.dad == HOM_REF and gt_types.mom == HET and gt_types.kid == HET ) \
           | (gt_types.dad == HET and gt_types.mom == HOM_REF and gt_types.kid == HET ) "
--gt-filter "gt_types.kid == HOM_ALT"

and have that apply across all families without having to type it out for each family.

This would be simple for a trio, but would require some kind of templating to work for more complex families.