Open brentp opened 9 years ago
Add ability to simultaneously identify families that have homozygous variants in a gene AND families that are compound het for variants in the same gene.
I'm thinking the way to do the comp_het-like thing with gene-wise is to add
--phase-family $family_id
that can be specified multiple times. For each time that was specified, there'd be 2 additional columns
$family-members
and $family-gts
which would be comma-delimited e.g.
mom,dad,kid G/T,G/G,T|G
mom,dad,kid A/A,A/C,A|C
from there, it should be fairly straight-forward to add a column for each row (and for each family) that indicates when an affected got the alternate allele from different parents. e.g. assuming we have variant_ids 1,2,3 for the 3 rows respectively.
mom,dad,kid G/T,G/G,T|G 3
mom,dad,kid C/C,C/C,C/C
mom,dad,kid A/A,A/C,A|C 1
So, it will output the same rows, but the user may decide to limit only to rows/genes for which there is a phase-pair.
This could get noisy if they request, e.g. 4 families but only 1 of them actually has a HET for the proband.
... Brainstorming...
So maybe phase-family should be $family_id:$filter
so that $family will be phased if $filter number passes, e.g:
gemini gene_wise \
--columns "gene, chrom, start, end, ref, alt, impact, impact_severity" \
--gt-filter "(gt_types.1_dad == HOM_REF and gt_types.1_kid == HET and gt_types.1_mom == HOM_REF)" \
--gt-filter "(gt_types.1_dad == HET and gt_types.1_kid == HET and gt_types.1_mom == HET)" \
--phase-family family_1:1,2
So in this case, it will phase family 1 if either filter passes?
This is getting complicated to understand and document. Probably should just leave --phase-family $family_id and let the user deal with the extra columns when they have a lot of families they want phased.
This isn't a super helpful comment because I don't currently have a better solution but:
I prefer the first one. The problem is that (if I understand it correctly), the first one doesn't really complete the comphet analysis in that the user still has to write extra, non-trivial (not just grep) code to identify genes with 2+ variants. The user's script would have to iterate through the whole list of variants, store the list of genes and phase-state of all the variants in each gene and return the genes in which there are 2+ variants with desired phase state. If you have to write such a script, it's not clear to me if you really save any time vs just doing separate autosomal_recessive and comp_het analyses and writing a script to combine the results from the two to identify overlapping candidate genes.
So, instead user specifies --phase-family $family_id as many times as they like and a single extra column indicates all the family:variant combinations where the alternate allele is from a different parent than the current. e.g.
family_1:33|family_2:21,33
then the user just has to grep for that column being non-empty.
The problem remains that, we want to be able to write something like:
--gt-filter "(gt_types.dad == HOM_REF and gt_types.mom == HET and gt_types.kid == HET ) \
| (gt_types.dad == HET and gt_types.mom == HOM_REF and gt_types.kid == HET ) "
--gt-filter "gt_types.kid == HOM_ALT"
and have that apply across all families without having to type it out for each family.
This would be simple for a trio, but would require some kind of templating to work for more complex families.
Currently, gene_wise simply counts variants in families, doesn't do any phasing. If we made it handle phasing as needed for comp_hets, then we could use the gene_wise machinery to drive all of the existing inheritance tools... It would also aid in looking for cases where we have a somatic variant in one sample and a germ-line in the corresponding blood sample. This will be a large change, would be good to understand all the use-cases. See @jxchong 's comment on #529