Closed naumenko-sa closed 4 years ago
Hi Sergey,
I brought this up in the past too. I've been a long time proponent of dumping duplicate annotations from vcfanno in favor of Ensembl annotations, since these seem to be better curated, up to date and require less work on the user end (as a whole, not specifically for GnomAD The argument used to be that we'd like to keep the annotations as close to the original GEMINI annotation set as possible. Since then, gemini as a default has been deprecated, so I suppose this is a good moment to look at the annotation strategy as a whole.
As it stands now, the users can 'pick and choose' which annotations they'd like to have with the vcfanno configs. I do suppose it makes sense to deprecate the gemini
vcfanno config in favor of separate configs for each part of the gemini
config.
I've also mentioned in the past that there's a lot of superflous annotating going on during the pipeline. For example, VEP is run twice on the same vcf file for some reason, which causes a whole lot of runtime overhead that can easily be avoided in my opinion.
Thanks for opening the discussion, I'm looking forward to see what the others have to say M
Hello Sergey, Matthias, I experienced similar issues as Sergey is pointing out. Apart of rethinking of decoupling gemini vcfanno from the custom annotations I would also suggest that very useful feature is making two separate columns in gemini variants table - one for ensembl_gene_id and another for entrez_id, because now VEP annotations uploaded to gemini - the ensembl and refseq identifiers are mixed under the same column named ensembl_gene_id , below is example:
sqlite> select variant_id, chrom, start, ref, alt, gene,transcript,ensembl_gene_id from variants where instr("HIGH",impact_severity) limit 50;
299|1|998581|G|C|RP11-465B22.3|ENST00000394517|ENSG00000217801 364|1|1246255|CGGCTCTGGGTCACAGGT|C|PUSL1|NM_001346116.1|126789 718|1|1669843|CCA|C|SLC35E2|ENST00000246421|ENSG00000215790 804|1|1887018|A|G|CFAP74|NM_001080484.1|85452 805|1|1887090|CG|C|CFAP74|NM_001080484.1|85452 806|1|1887110|GC|G|CFAP74|NM_001080484.1|85452 2106|1|5728198|A|C|RP11-154H17.1|ENST00000413887|ENSG00000236948 2107|1|5728199|C|A|RP11-154H17.1|ENST00000413887|ENSG00000236948 2278|1|5935161|A|T|NPHP4|ENST00000378156|ENSG00000131697
Also , along the same lines - it would be super-useful if we had identifier mappings pertaining to the specific VEP version ( for example for VEP GRCh37 v97 ) shipped with bcbio in flat file with columns symbol, ensg, entrez so that we could use it to link external annotations to genes. I am facing real difficulty now in matching these identifiers between hugo, ensembl biomart GRCh37 v97 , gene_info, gene2ensembl, gencode 19 and NCBI GRCh37.p13 interim genome GFF3 annotations. It would make so much easier if symbol, entrez and ensg part of VEP cache would be explicitly available as a flat file. Or maybe you could refer me where to start looking!
Hi Erinija,
The fact that the refseq and ensembl gene id's are mixed in the same field is because bcbio used the "merged" VEP cache. (see here)
If you like, you can parse the Ensembl cache version from the vcf header (also in the db).
Also keep in mind to check the variant_impacts
table for different transcripts per variant as annotated by VEP.
The annotations in the variants
table are only those for the transcript with the most severe consequence.
As a whole, I'm more inclined to look at the id's coming from the VEP, as opposed to other sources, since there might be some discrepancies between the sources due to version differences.
As for the flat file you're looking for, I suppose it shouldn't be that hard to export the required data from the sqlite and cross reference the version numbers from the vcf header and/or the provenance/data_versions.csv
file in your final
dir.
Hope this helps Cheers M
Hi Matthias,
Many thanks for clear and detailed answer. > there might be some discrepancies between the sources due to version differences. This is exactly what I am dealing with. I need to link the latest OMIM annotations to the variants annotated by gene ENSG /Entrez identifier provided by VEP for GRCh37. Some (really not an insubstantial number ~ 1000 ) OMIM records have assigned ENSG/Entrez identifiers that do not have matches in Ensembl GRCh37 v97 Biomart that should cover the GRCh37 build.
This is normal, we would expect that. It is simple to resolve these changed gene identifiers if the full ID set from VEP cache is available. The non-matching IDs can be traced and linked to their previous versions.
Ensembl GRCh37 v97 Biomart full identifier set accounts for all ENSG assigned to variants. That is clear. But it is not the case for RefSeq Entrez . I can't clarify what resource was used by VEP to annotated variants by Entrez IDs, A RefSeq source from which the cache was created.
In my VCF header the VEP part is
##VEP="v97" time="2019-09-01 17:19:35" cache="/projects/analysis/bcbio/genomes/Hsapiens/GRCh37/vep/homo_sapiens_merged/97_GRCh37" ensembl=97.378db18 ensembl-io=97.dc917e1 ensembl-variation=97.26a059c ensembl-funcgen=97.24f4d3c 1000genomes="phase3" COSMIC="86" ClinVar="201810" ESP="20141103" HGMD-PUBLIC="20174" assembly="GRCh37.p13" dbSNP="151" gencode="GENCODE 19" genebuild="2011-04" gnomAD="r2.1" polyphen="2.2.2" refseq="01_2015" regbuild="1.0" sift="sift5.2.2"
The specifics ensembl=97.378db18 ensembl-io=97.dc917e1 ensembl-variation=97.26a059c ensembl-funcgen=97.24f4d3c
do not give a hint about RefSeq. In here the GRCh37 RefSeq
source is pointed as 2015-01. Looking for the files in NCBI RefSeq's ftp site the sources are clear for GRCh38.p12 but not for GRCh37.p13, for which the closest would be RELEASE 106 . Having GENCODE 19 GFF3 and RefSeq GRCh37.p13 GFF3 I could resolve the identifiers, but not clear which RefSeq annotation release is in Ensemble homo_sapiens_merged VEP cache and where to look for this info.
Anyway, thank you for the leads that you provided and sorry taking scope of this discussion of focus.
Kind regards, Erinija
UPDATE I don't know if that may help anybody, but I strongly believe that the RefSeq version in VEP GRCh37 v97 homo_sapiens_merged, that I was providing example of, is RELEASE.105 . I checked the deprecated or changed EntrezIDs that are in ensembl_gene_id column of annotated variants and all of them were found in ref_GRCh37.p13_top_level.gff3.gz from here . So it is a release of a rather earlier date than 2015-01 . Only few of these EntrezIDs are found in a later top level GRCh37.p13 interim annotation gff3 . So I think I answered my question.
Hello, everyone!
Not exactly a bcbio issue, but a situation which is hard to catch when calling and prioritizing germline variants with bcbio/vep/vcfanno/gemini.
When using vep annotations and vcfanno config file which uses gnomad_exome and gnomad_genome vcf sources, we have two sources of GNOMAD frequencies: from VEP and from gnomad vcf files itself. These annotations could be different for the same variant.
https://gnomad.broadinstitute.org/variant/1-120572547-T-C?dataset=gnomad_r2_1 is filtered out in gnomad2.1, so in the gnomad vcf file (exomes and genomes) it will be absent (when using only PASS variants).
But VEP annotates this variant with gnomad_AF=0.44 http://grch37.ensembl.org/Tools/VEP
So, for this variant, there will be no annotations produced by vcfanno (which means empty fields in the gemini database), and gnomad_AF=0.44 from VEP in the gemini database. The default field(column) names are the same for some fields in VEP and vcfanno configs, or very similar, so the downstream analysis could mix them. And then you have allele count = 0 (from gnomad.vcf) and at the same time gnomad_AF=0.44 from VEP.
The easiest way to mitigate this issue is to use more specific names in vcfanno.config, i.e vcfanno_gnomad_af instead of gnomad_af for fields selected from gnomad.vcf.gz.
Sergey