arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

Output of Compound Het, Pathway, and other tools, custom queries #173

Open dgaston opened 11 years ago

dgaston commented 11 years ago

Is there currently a plan (or reason against) matching the output style of the compound het, pathway, and interaction tools to be more like the mendelian disease, de novo, and query outputs? Particularly for the compound het tool this seems to make sense.

Also would it be useful to allow a -q argument for the mendelian disease and de novo tools to override the default query with a custom query for custom reporting?

At least with the second option I could work up a quick solution. The first may take a bit more work. I think I am becoming familiar enough with the codebase and data structures to do it.

arq5x commented 11 years ago

Hi Dan,

These are good suggestions, thanks. Certainly we can try to conform the output of the compoundhet tool to be more like the auto* tools. That said, I am currently revamping those tools to, by default, report all columns in the variants table. One can use a --columns option to list just the columns one wishes. Moreover, in response to your second point, the new versions will allow --filter and --gt-filter to enable one to filter variants based on the equivalent of a WHERE clause and genotype filters, respectively. In effect, this will be the same as applying a query to restrict candidate variants in the way you describe.

As for the -q option you mention, what were you envisioning for the custom queries beyond changing the columns or limiting the rows that were output?

dgaston commented 11 years ago

Hi Aaron,

No, I was just thinking of a -q option being a quick and easy way of allowing SQL queries in the same way as the existing query function. But your idea of a --filter flag and --columns seems better in terms of making the tool more friendly for non database people. I added -q to my own version of the code on my local machine as a quick fix for my own personal needs.

arq5x commented 11 years ago

I haven't written formal docs or tests for this yet, but last night I pushed changes that refine the output of comp_hets, and allow one to select specific columns --columns to report (in addition to some core, required fields). Also, like the new changes to de_novo, auto_rec and auto_dom, these changes include a --filter option to impose additional criteria on the reported variants.

Te most notable change is that the two variants from each comp_het are reported on separate lines for consistency with the other tools. The comp_het_id can be used to track which two variants represent a given comp_het.

$ gemini comp_hets sms.100000.vcf.db --ignore-phasing --columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp" | head
family  sample  comp_het_id chrom   start   end ref alt gene    impact  impact_severity in_dbsnp    num_het
1   SMS173  1 chr1  153302826   153302827   T   G   PGLYRP4 UTR_3_prime LOW 1   1 
1   SMS173  1 chr1  153314152   153314153   C   A   PGLYRP4 non_syn_coding  MED 1   1
1   SMS173  2 chr1  153302826   153302827   T   G   PGLYRP4 UTR_3_prime LOW 1   1
1   SMS173  2 chr1  153315578   153315579   A   G   PGLYRP4 synonymous_coding   LOW  1  1
1   SMS173  3 chr1  153302826   153302827   T   G   PGLYRP4 UTR_3_prime LOW 1   1
1   SMS173  3 chr1  153320371   153320372   T   G   PGLYRP4 non_syn_coding  MED 1   1
1   SMS173  4 chr1  153314152   153314153   C   A   PGLYRP4 non_syn_coding  MED 1   1
1   SMS173  4 chr1  153302826   153302827   T   G   PGLYRP4 UTR_3_prime LOW 1   1
1   SMS173  5 chr1  153314152   153314153   C   A   PGLYRP4 non_syn_coding  MED 1   1

$ gemini comp_hets sms.100000.vcf.db --ignore-phasing --columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp" | wc -l

2213

Now, restrict to solely those candidates impacting affected individuals.

$ gemini comp_hets sms.100000.vcf.db --ignore-phasing \
              --columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp"\ 
              --only-affected | wc -l

1941

Now, restrict to solely high impact variants affecting those candidates impacting affected individuals.

$ gemini comp_hets sms.100000.vcf.db --ignore-phasing \
              --columns "chrom, start, end, ref, alt, gene, impact, impact_severity, in_dbsnp" \
              --only-affected \
              --filter "impact_severity = 'HIGH'" | wc -l

6