exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
197 stars 54 forks source link

target, candidate variants are filtered out during exomiser #386

Open hongsilv opened 3 years ago

hongsilv commented 3 years ago

Hi. This is Sungeun. Thank you for the great software!

While using exomiser, I realized that some variants are filtered out, even though I know that they pass the variant filters. For example, vcf rows with the following variants (rs146768859, rs141509078) are present in the input vcf files, and I know they are coding region, nonsynonymous SNPs, and their allele frequency is below 0.1% . (+ I didn't set any inheritance filters for exomiser analysis) But when I run the exomiser, they are not present in the output vcf.

I tried two approaches such: 1) remove all variant filters in analysis yml file (except the AF filter)-> still they are filtered out 2) subset input vcf to include regions near SNP (10kb ~ 100kb), and run exomiser on that smaller input vcf -> SNPs nearby are all together filtered out.

Do you have any idea why these variants are filtered out, and how I can solve this?

(HG 19, rs146768859) chr4 | 186456560 | . | G | A | 46780.9 | PASS | AC=2;AF=0.03;AN=2;AS_BaseQRankSum=6.671;AS_FS=1.396;AS_InbreedingCoeff=0.3761;AS_MQ=60;AS_MQRankSum=0;AS_QD=18.95;AS_ReadPosRankSum=0.857;AS_SOR=0.619;BaseQRankSum=3.22;DP=24846;ExcessHet=0.1353;FS=1.396;InbreedingCoeff=0.3761;MLEAC=5;MLEAF=0.03;MQ=60;MQRankSum=0;QD=18.95;ReadPosRankSum=0.095;SOR=0.619 | GT:AD:DP:GQ:PL | 1/1:0,862:.:99:25451,2579,0

(HG19, rs141509078) chr15 | 72191053 | . | T | G | 1002.01 | PASS | AC=2;AF=0.018;AN=2;AS_BaseQRankSum=-1.055;AS_FS=0;AS_InbreedingCoeff=0.6493;AS_MQ=60;AS_MQRankSum=0;AS_QD=24.44;AS_ReadPosRankSum=-1.068;AS_SOR=0.968;BaseQRankSum=-1.019;DP=2402;ExcessHet=0.0402;FS=0;InbreedingCoeff=0.6493;MLEAC=3;MLEAF=0.018;MQ=60;MQRankSum=0;QD=24.44;ReadPosRankSum=-0.957;SOR=0.968 | GT:AD:DP:GQ:PL | 1/1:0,27:.:81:778,81,0

(Example of my yml file)

analysis:
  analysisMode: PASS_ONLY
  frequencySources:
  - THOUSAND_GENOMES
  - TOPMED
  - UK10K
  - ESP_AFRICAN_AMERICAN
  - ESP_EUROPEAN_AMERICAN
  - ESP_ALL
  - EXAC_AFRICAN_INC_AFRICAN_AMERICAN
  - EXAC_AMERICAN
  - EXAC_SOUTH_ASIAN
  - EXAC_EAST_ASIAN
  - EXAC_FINNISH
  - EXAC_NON_FINNISH_EUROPEAN
  - EXAC_OTHER
  - GNOMAD_E_AFR
  - GNOMAD_E_AMR
  - GNOMAD_E_EAS
  - GNOMAD_E_FIN
  - GNOMAD_E_NFE
  - GNOMAD_E_OTH
  - GNOMAD_E_SAS
  - GNOMAD_G_AFR
  - GNOMAD_G_AMR
  - GNOMAD_G_EAS
  - GNOMAD_G_FIN
  - GNOMAD_G_NFE
  - GNOMAD_G_OTH
  - GNOMAD_G_SAS
  genomeAssembly: hg19
  hpoIds:
  - HP:0002515
  - HP:0030234
  - HP:0030319
  - HP:0030192
  - HP:0003693
  - HP:0003326
  - HP:0003560
  inheritanceModes: {}
  pathogenicitySources:
  - POLYPHEN
  - MUTATION_TASTER
  - SIFT
  ped: null
  proband: null
  steps:
  #- variantEffectFilter:
      #remove:
      #- FIVE_PRIME_UTR_EXON_VARIANT
      #- FIVE_PRIME_UTR_INTRON_VARIANT
      #- THREE_PRIME_UTR_EXON_VARIANT
      #- THREE_PRIME_UTR_INTRON_VARIANT
      #- NON_CODING_TRANSCRIPT_EXON_VARIANT
      #- UPSTREAM_GENE_VARIANT
      #- INTERGENIC_VARIANT
      #- REGULATORY_REGION_VARIANT
      #- CODING_TRANSCRIPT_INTRON_VARIANT
      #- NON_CODING_TRANSCRIPT_INTRON_VARIANT
      #- DOWNSTREAM_GENE_VARIANT
  - frequencyFilter:
      maxFrequency: 1.0
  - pathogenicityFilter:
      keepNonPathogenic: true
  - inheritanceFilter: {}
  - omimPrioritiser: {}
  - hiPhivePrioritiser: {}
  vcf: /Volumes/exomiser/sample.NoAnno.indelMax70.cov5.chr4.10kb.vcf
outputOptions:
  numGenes: 0
  outputContributingVariantsOnly: false
  outputFormats:
  - HTML
  - TSV_GENE
  - TSV_VARIANT
  - VCF
  outputPrefix: results/v4/analysis_sample_10kb
hongsilv commented 3 years ago

When I set a less stringent maxFrequency filter (AF<10%), I realized that they were not filtered out, and their MAX_FREQUENCY values were annotated as rs146768859 - 1.2056% rs141509078 - 1.1757% which is much higher than that in gnomAD or 1KG.

Also, upon closer inspection, both variants had the same DBSNP_FREQ value of 0.0013980001 (in .variants.tsv).

image

I think there is some bug regarding to allele frequency annotation or filtering. How can we fix this?

damiansm commented 3 years ago

Hi Sungeun,

For the first one it looks like the issue is either in the way EXAC calculated the EAS frequency or we have a parsing error when we incorporated it. @julesjacobsen should be able to confirm it is 1.2% in the EXAC legacy data we have used. The simple solution here will be to just remove the EXAC_* sources from your yml file given this is all in GnomAD nowadays.

For the second example can you see which source is generating the 1.1757% frequency. Would be interested to know but again removing it as a source may be a route forward.

hongsilv commented 3 years ago

Hi @damiansm Thank you for your prompt reply. According to the html file, the frequency source was

gnomAD_G_EAS: 1.1757%,

However when I checked it out on gnomAD browser (v3.1, Genomes), its allele frequency was 0.8478% image

(link: https://gnomad.broadinstitute.org/variant/rs141509078?dataset=gnomad_r3)

Although the difference between 1.175% and 0.8478% is not so big, I am a little curious where this difference comes from. And for this case, my sample is from East Asians, and I'm not sure if its okay to remove gnomAD_G_EAS from frequency sources.

Again, thank you for the great software, and if available I'll be waiting for your reply!

damiansm commented 3 years ago

Hi Sungeun,

Can you confirm which version of the Exomiser databases you are using so we are sure which version of GnomAD went into our build. I suspect the frequency has flipped between the 1% threshold between GnomAD releases.

Best wishes Damian

On Fri, Nov 13, 2020 at 8:28 AM Sungeun Hong notifications@github.com wrote:

Hi @damiansm https://github.com/damiansm Thank you for your prompt reply. According to the html file, the frequency source was

gnomAD_G_EAS: 1.1757%,

However when I checked it out on gnomAD browser (v3.1, Genomes), its allele frequency was 0.8478% [image: image] https://user-images.githubusercontent.com/52628609/99045183-4c664f80-25d4-11eb-805a-4dc4d2637478.png (link: https://gnomad.broadinstitute.org/variant/rs141509078?dataset=gnomad_r3)

Although the difference between 1.175% and 0.8478% is not so big, I am a little curious where this difference comes from. And for this case, my sample is from East Asians, and I'm not sure if its okay to remove gnomAD_G_EAS from frequency sources.

Again, thank you for the great software, and if available I'll be waiting for your reply!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/386#issuecomment-726601056, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHO4PGLBCIZOYJUIUJTJ63SPTU3XANCNFSM4TQHFHWQ .

hongsilv commented 3 years ago

Hi Sungeun, Can you confirm which version of the Exomiser databases you are using so we are sure which version of GnomAD went into our build. I suspect the frequency has flipped between the 1% threshold between GnomAD releases. Best wishes Damian

Hi,@damiansm the database and exomiser version I'm currently using is as following:

exomiser version: 12.1.0 database version: 2007_hg19

If this problem is due to gnomAD releases, can you recommend a way I can deal with this?

damiansm commented 3 years ago

Hi Sungeon,

We will get back to you soon with a more detailed response while we check exactly where this difference came from. For now though the only way you can deal with it in your queries if you want to keep that variant is to up your MAF filtering from 1% to 1.25%. You will obviously get some more false positives though

Best wishes Damian

On Fri, Nov 13, 2020 at 9:38 AM Sungeun Hong notifications@github.com wrote:

Hi Sungeun, Can you confirm which version of the Exomiser databases you are using so we are sure which version of GnomAD went into our build. I suspect the frequency has flipped between the 1% threshold between GnomAD releases. Best wishes Damian … <#m5056933858742594577>

Hi, the database and exomiser version I'm currently using is as following:

exomiser version: 12.1.0 database version: 2007_hg19

If this problem is due to gnomAD releases, can you recommend a way I can deal with this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/386#issuecomment-726658928, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHO4PDI3K2QBMG6RD6LNGDSPT5BDANCNFSM4TQHFHWQ .

hongsilv commented 3 years ago

Okay, thanks for your effort to check, and again for the great software!

Best regards, Sungeun

julesjacobsen commented 3 years ago

Exomiser data hg19_2007

source rs141509078 rs146768859
THOUSAND_GENOMES 0.13980001 0.13980001
TOPMED 0.077250004 0.05415
ESP_EUROPEAN_AMERICAN - 0.0116
ESP_ALL - 0.0077
EXAC_AMERICAN 0.00865651 -
EXAC_EAST_ASIAN 0.67052025 1.2056044
EXAC_OTHER 0.11013216 -
GNOMAD_E_AMR 0.014899577 -
GNOMAD_E_EAS 0.7654836 0.70774865
GNOMAD_E_OTH 0.03652301 -
GNOMAD_E_SAS 0.003248863 -
GNOMAD_G_AFR 0.011457379 -
GNOMAD_G_EAS 1.1757426 0.80147964
GNOMAD_G_OTH - 0.10183299

You need to compare with gnomAD v2 which is what's being used in the Exomiser database. gnomAD v3 is genomes only and only called against hg38. Notice if you uncheck the genomes/exomes box in the link below the frequencies change.

https://gnomad.broadinstitute.org/variant/rs141509078?dataset=gnomad_r2_1 Genomic coordinates: 15-72191053-T-G (GRCh37)

Datasource AC AN AF (%)
v2.1.1 exomes EAS 145 18390 0.7885
v2.1.1 genomes EAS 18 1556 1.157
v3 genomes EAS 44 5190 0.8478

Here you can see that, on its own the genomes EAS has a frequency of 1.157% gnomAD add and remove samples and change their filtering with each release so the sample size and hence the frequency will change between releases.

The Exomiser hg19_2007 source data was downloaded from:

http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/ExAC.0.3.GRCh37.vcf.gz http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz

$ tabix  http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/ExAC.0.3.GRCh37.vcf.gz 15:72191053-72191053

15      72191053        rs141509078     T       G       182588.64       PASS    AC=60;AC_AFR=0;AC_AMR=1;AC_Adj=60;AC_EAS=58;AC_FIN=0;AC_Het=60;AC_Hom=0;AC_NFE=0;AC_OTH=1;AC_SAS=0;AF=4.942e-04;AN=121412;AN_AFR=10402;AN_AMR=11552;AN_Adj=121358;AN_EAS=8650;AN_FIN=6612;AN_NFE=66726;AN_OTH=908;AN_SAS=16508;...

$ tabix http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz 15:72191053-72191053
15      72191053        rs141509078     T       G       373313.05       PASS    AC=140;AF=5.69411e-04;AN=245868;BaseQRankSum=-3.24100e+00;ClippingRankSum=1.71000e-01;DB;DP=9317132;FS=0.00000e+00;InbreedingCoeff=6.00000e-03;MQ=5.93500e+01;MQRankSum=-1.90000e-02;QD=1.34900e+01;ReadPosRankSum=6.93000e-01;SOR=7.02000e-01;VQSLOD=3.67000e+00;VQSR_culprit=MQ;VQSR_POSITIVE_TRAIN_SITE;GQ_HIST_ALT=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|140;DP_HIST_ALT=0|0|0|0|0|1|0|1|1|1|4|2|2|4|4|2|2|3|2|0;AB_HIST_ALT=0|0|0|0|0|0|1|2|15|53|44|16|8|0|1|0|0|0|0|0;GQ_HIST_ALL=41|26|24|49|66|38|70|79|44|87|109|83|310|107|306|142|440|129|603|120383;DP_HIST_ALL=33|126|174|201|413|608|7577|45743|32994|13527|5923|3386|2016|1535|1219|1045|794|753|693|673;AB_HIST_ALL=0|0|0|0|0|0|1|2|15|53|44|16|8|0|1|0|0|0|0|0;AC_AFR=0;AC_AMR=5;AC_ASJ=0;AC_EAS=132;AC_FIN=0;AC_NFE=0;AC_OTH=2;AC_SAS=1;AC_Male=68;AC_Female=72;AN_AFR=15300;AN_AMR=33558;AN_ASJ=9844;AN_EAS=17244;AN_FIN=22290;AN_NFE=111376;AN_OTH=5476;AN_SAS=30780;

$ tabix http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.genomes.r
2.0.1.sites.noVEP.vcf.gz 15:72191053-72191053
15      72191053        rs141509078     T       G       5998.03 PASS    AC=20;AF=6.45953e-04;AN=30962;AC_AFR=1;AC_AMR=0;AC_ASJ=0;AC_EAS=19;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_Male=14;AC_Female=6;AN_AFR=8728;AN_AMR=838;AN_ASJ=302;AN_EAS=1616;AN_FIN=3494;AN_NFE=15004;AN_OTH=980;AN_Male=17112;AN_Female=13850;AF_AFR=1.14574e-04;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=1.17574e-02;AF_FIN=0.00000e+00;AF_NFE=0.00000e+00;AF_OTH=0.00000e+00;AF_Male=8.18139e-04;AF_Female=4.33213e-04;GC_AFR=4363,1,0;GC_AMR=419,0,0;GC_ASJ=151,0,0;GC_EAS=789,19,0;GC_FIN=1747,0,0;GC_NFE=7502,0,0;GC_OTH=490,0,0;GC_Male=8542,14,0;GC_Female=6919,6,0;AC_raw=20;AN_raw=30992;AF_raw=6.45328e-04;GC_raw=15476,20,0;GC=15461,20,0;AC_POPMAX=19;AN_POPMAX=1616;AF_POPMAX=1.17574e-02

Extracting the EAS data gives us these values:

Datasource AC AN AF (%) (AF = (AC/AN) *100)
ExAC EAS 58 8650 0.67
v2.0.1 exomes EAS 132 17244 0.765
v2.1.1 exomes EAS 145 18390 0.7885
v2.0.1 genomes EAS 19 1616 1.176
v2.1.1 genomes EAS 18 1556 1.157
v3 genomes EAS 44 5190 0.8478

Which matches what you get from the Exomiser hg19_2007 dataset, so I don't see any errors here.

A possible solution would be that if you're analysing an exome it might be better to only use exome sources as the number of samples used in the gnomAD genomes GRCh37 dataset was relatively small which will lead to inaccurate/ higher frequencies.