Open dgaston opened 11 years ago
Hi Dan,
These are good suggestions, thanks. Certainly we can try to conform the output of the compoundhet tool to be more like the auto* tools. That said, I am currently revamping those tools to, by default, report all columns in the variants table. One can use a --columns
option to list just the columns one wishes. Moreover, in response to your second point, the new versions will allow --filter
and --gt-filter
to enable one to filter variants based on the equivalent of a WHERE clause and genotype filters, respectively. In effect, this will be the same as applying a query to restrict candidate variants in the way you describe.
As for the -q
option you mention, what were you envisioning for the custom queries beyond changing the columns or limiting the rows that were output?
Hi Aaron,
No, I was just thinking of a -q option being a quick and easy way of allowing SQL queries in the same way as the existing query function. But your idea of a --filter flag and --columns seems better in terms of making the tool more friendly for non database people. I added -q to my own version of the code on my local machine as a quick fix for my own personal needs.
I haven't written formal docs or tests for this yet, but last night I pushed changes that refine the output of comp_hets, and allow one to select specific columns --columns
to report (in addition to some core, required fields). Also, like the new changes to de_novo, auto_rec and auto_dom, these changes include a --filter
option to impose additional criteria on the reported variants.
Te most notable change is that the two variants from each comp_het are reported on separate lines for consistency with the other tools. The comp_het_id
can be used to track which two variants represent a given comp_het.
$ gemini comp_hets sms.100000.vcf.db --ignore-phasing --columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp" | head
family sample comp_het_id chrom start end ref alt gene impact impact_severity in_dbsnp num_het
1 SMS173 1 chr1 153302826 153302827 T G PGLYRP4 UTR_3_prime LOW 1 1
1 SMS173 1 chr1 153314152 153314153 C A PGLYRP4 non_syn_coding MED 1 1
1 SMS173 2 chr1 153302826 153302827 T G PGLYRP4 UTR_3_prime LOW 1 1
1 SMS173 2 chr1 153315578 153315579 A G PGLYRP4 synonymous_coding LOW 1 1
1 SMS173 3 chr1 153302826 153302827 T G PGLYRP4 UTR_3_prime LOW 1 1
1 SMS173 3 chr1 153320371 153320372 T G PGLYRP4 non_syn_coding MED 1 1
1 SMS173 4 chr1 153314152 153314153 C A PGLYRP4 non_syn_coding MED 1 1
1 SMS173 4 chr1 153302826 153302827 T G PGLYRP4 UTR_3_prime LOW 1 1
1 SMS173 5 chr1 153314152 153314153 C A PGLYRP4 non_syn_coding MED 1 1
$ gemini comp_hets sms.100000.vcf.db --ignore-phasing --columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp" | wc -l
2213
Now, restrict to solely those candidates impacting affected individuals.
$ gemini comp_hets sms.100000.vcf.db --ignore-phasing \
--columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp"\
--only-affected | wc -l
1941
Now, restrict to solely high impact variants affecting those candidates impacting affected individuals.
$ gemini comp_hets sms.100000.vcf.db --ignore-phasing \
--columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp" \
--only-affected \
--filter "impact_severity = 'HIGH'" | wc -l
6
Is there currently a plan (or reason against) matching the output style of the compound het, pathway, and interaction tools to be more like the mendelian disease, de novo, and query outputs? Particularly for the compound het tool this seems to make sense.
Also would it be useful to allow a -q argument for the mendelian disease and de novo tools to override the default query with a custom query for custom reporting?
At least with the second option I could work up a quick solution. The first may take a bit more work. I think I am becoming familiar enough with the codebase and data structures to do it.