brentp commented 8 years ago

via @jxchong cc @arq5x A user may find a variant that is not itself associated with disease, but is in a gene known to be associated with the disease. There is no way to store or query this info in gemini at this time.

clinvar is an obvious choice and omim seems to be available to download for academic (http://omim.org/downloads).

Design

need a way to do this in a relatively generic fashion as we think about switching to support any organism.
need a simple query to do this even though we'd probably be storing in a table that's many-to-1 relative to gene_detailed (?)
could provide an example where user can use gemini annotate and smash all associations into a single "|" delimited string for every variant in that gene and it would be a single column in the variants table.
likely better to have a "|" delimited string associated with each gene and require a join on gene_name.

jxchong commented 8 years ago

For humans, OMIM is certainly where I check first.

For extending to other organisms, I suspect Monarch Initiative is the place to start (e.g. Exomiser uses model organism phenotype data curated by Monarch Initiative). I would guess you would want to be able to limit the "associated phenotype" to only those phenotypes that are actually known to be caused by mutations in the gene (vs GWAS signals, etc -- or at least a gene associated with a phenotype by GWAS is a different level of evidence/information than a gene in which mutations cause a Mendelian phenotype).

brentp commented 8 years ago

One way to do this currently would be to create a bed file with a pipe delimited disease list for each gene and then use gemini annotate. We do want a more direct means, but that would be a general solution that's available in current gemini.

jxchong commented 8 years ago

Yeah I have an old file I made from OMIM that I'll use for now. Actually it occurs to me that this is the more general solution for this is also applicable to integrating pathway info -- I know pathways is currently a separate tool within Gemini, but ideally when doing variant filtration, you'd want to be able to output the pathway/KEGG/etc information for the gene that the variant is affecting.

jxchong commented 8 years ago

Oh hey! I think this appears to be implemented in VEP v82 with the VEP option --gene_phenotype Indicates if the overlapped gene is associated with a phenotype, disease or trait. See list of phenotype sources. Not used by default

This appears to be multi-species. List of phenotype sources: http://uswest.ensembl.org/info/genome/variation/sources_phenotype_documentation.html

I have not had time to test yet though.

jxchong commented 8 years ago

According to the VEP devs, --gene_phenotype combined with --fields PHENO,GENE_PHENO should result in two columns (PHENO and GENE_PHENO) which ==1 if the variant or gene, respectively, are linked to a phenotype or ==0 if not linked to a phenotype. However there's some sort of issue in the cache files for VEPv82 that's making these lookup values always = 0 so I still can't test it yet. ;)

brentp commented 8 years ago

is there an additional column that indicates the phenotype? if so, this wont require any change to gemini.

jxchong commented 8 years ago

Apparently not... they said:

The phenotype name does not get stored in the output. Firstly, many phenotype names are long and contain odd characters that can break file format encoding. Secondly, many genes and/or variants have several phenotypes associated with them, so to list all of them (and multiple times in the case where you are reporting e.g. the same gene multiple times) would cause the output file size to increase hugely.

I'm debating whether to argue that it should be an option to output the phenotype name (though the above response makes me think they're reluctant to add it)

brentp commented 8 years ago

hrm. without that, it's not very helpful and with it, it'd be very helpful... and the more disease annotations they get, the bigger that divide in helpfulness becomes.

jxchong commented 8 years ago

I emailied the VEP dev mailing list to ask. You have a good point, right now a "1" tells me almost nothing -- for humans I'd already have to search nearly 20 databases to see which DB is the source of the "1" and it might have been wasted effort if the phenotype isn't even relevant.

jxchong commented 8 years ago

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any new code!

1) Create a BED file from Ensembl's database of phenotype annotations. This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38 -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat("\"", concat_ws(":", pf.type, s.name, pf.object_id, p.description), "\"") from seq_region sr, source s, phenotype p, phenotype_feature pf where pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c phenotypes.bed.gz

2) Index it with tabix:

tabix -p bed phenotypes.bed.gz

3) Use it as a custom annotation in VEP (see http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache -custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data, or remove the structural variants for example (there are a lot of them and they tend to overlap a significant portion of the genome).

dgaston commented 8 years ago

This would also be a good method to use the resulting BED file either as a custom annotation source with snpEff/snpSift, or just with GEMINI itself using gemini annotate. This is very useful thanks!

On Tue, Nov 10, 2015 at 1:43 PM, Jessica Chong notifications@github.com wrote:

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any new code!

1) Create a BED file from Ensembl's database of phenotype annotations. This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38 -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat("\"", concat_ws(":", pf.type, s.name, pf.object_id, p.description), "\"") from seq_region sr, source s, phenotype p, phenotype_feature pf where pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c phenotypes.bed.gz

2) Index it with tabix:

tabix -p bed phenotypes.bed.gz

3) Use it as a custom annotation in VEP (see http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache -custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data, or remove the structural variants for example (there are a lot of them and they tend to overlap a significant portion of the genome).

— Reply to this email directly or view it on GitHub https://github.com/arq5x/gemini/issues/571#issuecomment-155506728.

brentp commented 8 years ago

That is useful. I couldn't figure out the invocation for 82_37 ...

arq5x commented 8 years ago

Current plan is to use VEP "annotation dump" to support this.

brentp commented 8 years ago

another couple of sources are: http://ctdbase.org/downloads/ and: http://www.disgenet.org/web/DisGeNET/menu/dbinfo#stats

both of which seem to have licenses that would allow us to use them in gemini

jxchong commented 8 years ago

FYI when using VEP (v83+) with --gene_phenotype to add information on whether a variant is in a gene that has a phenotype associated with it (resulting GEMINI column is vep_gene_pheno), or --check-existing to add whether a specific position has been observed as having a variant (resulting GEMINI column is vep_pheno), VEP leaves those fields blank (so the column is currently imported into GEMINI as NULL).

here are some examples from running VEP v83 on some known pathogenic variants in CFTR:

VEP options were:
--sift b --polyphen b --symbol --numbers --biotype --total_length --canonical --ccds --hgvs --shift_hgvs 1 --gene_phenotype --regulatory --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO,GENE_PHENO,CELLL_TYPE

7   117199533   rs213950    G   A   137788  PASS    AC=10;AF=0.584;AN=24;BaseQRankSum=-2.368;ClippingRankSum=-0.081;DB;DP=7276;FS=1.206;InbreedingCoeff=0.0956;MLEAC=125;MLEAF=0.584;MQ0=0;MQ=60;MQRankSum=-0.032;POSITIVE_TRAIN_SITE;QD=22.5;ReadPosRankSum=0.197;SOR=0.817;VQSLOD=8.12;culprit=MQ;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1&1&1||,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000426809|10/26|benign(0.002)|tolerated(1)|440/1438|protein_coding|||ENST00000426809.1:c.1318G>A|ENSP00000389119.1:p.Val440Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000454343|10/26|benign(0.001)|tolerated(0.88)|409/1419|protein_coding|||ENST00000454343.1:c.1225G>A|ENSP00000403677.1:p.Val409Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000003084|11/27|benign(0.003)|tolerated(1)|470/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1408G>A|ENSP00000003084.6:p.Val470Met|1&1&1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1&1&1|1|    GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,242   0/1:9,11:20:99:292,0,229    0/0:7,0:7:21:0,21,248   1/1:0,26:26:78:865,78,0 0/1:35,24:59:99:589,0,941   0/1:23,25:48:99:688,0,675   0/0:22,0:22:66:0,66,714 0/1:21,14:35:99:367,0,590   0/0:22,0:22:62:0,62,736 1/1:0,22:22:66:732,66,0 0/1:13,15:28:99:430,0,341   0/1:15,22:37:99:642,0,395
7   117199644   rs199826652 ATCT    A   3996.41 PASS    AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1|1|,regulatory_region_variant|||||ENSR00000633245|||||open_chromatin_region|||||1||    GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135 0/0:10,0:10:24:0,24,360 0/0:3,0:3:5:0,5,86  0/1:10,12:22:99:441,0,474   0/1:9,18:27:99:729,0,424    0/0:23,0:23:63:0,63,945 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096    0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881   0/1:13,16:29:99:620,0,632   0/0:24,0:24:60:0,60,791

There's a little weirdness with them using 1&1 or 1&1&1 and so on for the PHENO field -- will update when I figure it out (I'm assuming it's indicating a positive answer for each of their data sources? but not sure).

jxchong commented 8 years ago

If PHENO==1&1&1 each ofthose correspond to an ID in VEP's Existing_variation field

Example:

Existing_variation == rs2691305&rs75062661&COSM4144171
PHENO == 0&0&1

This means that rs2691305 and rs75062661 are not associated with a phenotype, disease, or trait, but COSM4144171 is.

This is sort of like a super-set that includes clinvar_sig (since VEP includes ClinVar)

brentp commented 8 years ago

@jxchong any ideas on how to handle this gene_phenotype? I don't want to have to special-case the Existing_variation and PHENO keys. I guess if it gets added to the db as-is, a user could query as:

... AND pheno like '%1%'

but not sure I grok enough to understand how useful that is.

jxchong commented 8 years ago

I think both of these keys can be added as-is to the GEMINI database using GEMINI's current standard handling of extra fields. (except for minor issue that blank values are not being output by GEMINI as "None" mentioned https://github.com/arq5x/gemini/issues/639)

--gene_phenotype is straightforward, I think. GEMINI users can just query for .. AND vep_gene_pheno == 1

Personally I don't envision filtering on the results of VEP's PHENO field (vep_existing_variation or vep_pheno) because it's just too hard to determine automatically whether the variant being present in a database is even meaningful. It would be as bad as an idea (for gene discovery purposes) as filtering on in_dbsnp :) Instead I think I will just include those columns as part of the output for later manual/custom parsing (because having it there is still helpful and faster than searching each variant to see if it's present in any of the databases parsed by VEP).

arq5x / gemini

gene :: disease lookup. #571

Design