arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

Wrongly reported gnomad_num_het #895

Closed mmoisse closed 6 years ago

mmoisse commented 6 years ago

I noticed that for variants that are multi allelic in the Gnomad vcf the number of heterozygous variants is always reported as 0. I believe this is the consequence of the GC_Male and GC_Female INFO fields missing in the parsed Gnomad vcf files at multi allelic loci. While the GC_Male and GC_Female INFO fields are still present in the original Gnomad vcf they are gone after vt decomposition (https://github.com/atks/vt/issues/87). Since the number of heterozygous variants is calulated based on GC_Male an GC_Female, it is wrongly reported as 0 for these multi allelic positions.

oleraj commented 6 years ago

Any update on this issue? This is something we have run into as well. Maybe we could update the VCF using the fixed vt referenced above?

brentp commented 6 years ago

I'll update the gnomad exomes stuff and ping this issue when it's up. thanks for the reminder and thanks @mmoisse for tracking down the problem.

oleraj commented 6 years ago

@brentp

Thanks for updating the exome file.

We're getting an error when we try to run install-data.py:

wget failed with non-zero exit code 8. Retrying
--2018-08-16 14:34:31--  https://s3.amazonaws.com/gemini-annotations/gnomad.exomes.r2.0.2.sites.no-VEP.nohist.tidy.vcf.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.104.13
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.104.13|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2018-08-16 14:34:31 ERROR 403: Forbidden.

wget failed with non-zero exit code 8. Retrying
Traceback (most recent call last):
  File "install-data.py", line 171, in <module>
    install_annotation_files(args.anno_dir, args.dl_files, args.extra)
  File "install-data.py", line 106, in install_annotation_files
    to_dl, anno_dir, cur_config)
  File "install-data.py", line 124, in _download_anno_files
    cur_config.get("versions", {}).get(orig, 1))
  File "install-data.py", line 152, in _download_to_dir
    raise ValueError("Failed to download with wget")
ValueError: Failed to download with wget

Does the permission need to be updated?

brentp commented 6 years ago

sorry about that. can you try again?

oleraj commented 6 years ago

That worked, thanks. We had some other warnings/errors in our installation using the master branch, which @ponomarevsy posted here. Not sure if these are critical.

zhanhuizhang commented 5 years ago

@brentp
Thanks for updating gemini v0.30.1. The problem has occurred again. 'GC_Male' and 'GC_Female' were not found in gnomad.exomes.r2.1.tidy.bcf, as a consequence, the gnomad_num_het is always reported as 0.

And could you include the 'popmax','AF_popmax' in gnomad_v2.1 ?
popmax: Allele frequency information for the outbred population with the highest frequency. This excludes Finns, Ashkenazi Jewish and “Other” populations.

Thanks.

brentp commented 5 years ago

@zhanhuizhang I have pushed a fix for this to master, would you give it a try? You'll have to reload your database. thanks for reporting.

zhanhuizhang commented 5 years ago

@brentp Thanks for the quick fix. The 'gnomad_num_het' and 'gnomad_num_hom_alt' are corrected, but the gnomad_popmax_af is always -1. Perhaps GEMINI should get 'AF_popmax' from the gnomad_v2.1.bcf, not the 'popmax_AF'. THANKS!

brentp commented 5 years ago

sorry about that. I just pushed a fix. Thanks for noticing.

zhanhuizhang commented 5 years ago

That worked ~~~ :)