chapmanb / cloudbiolinux

CloudBioLinux: configure virtual (or real) machines with tools for biological analyses
http://cloudbiolinux.org
MIT License
257 stars 158 forks source link

GnomAD annotations #254

Closed matthdsm closed 6 years ago

matthdsm commented 7 years ago

Hi Brad,

We're very interested in adding GnomAD annotations to the pipeline. Do you know of a resource which can provide these annotations in both GRCh37 and hg38 coordinates? It's a bummer this isn't natively available from Broad.

I saw there was a new dbSNFP release yesterday which includes GnomAD annotations, but before jumping on that train I'd like to get the opinion of the community and see if there are any alternatives available. Alternatively, it'd be great if you know anyone who could point me to such resources.

Feel free to add anyone who could be able to chime int.

Thanks a lot M

matthdsm commented 7 years ago

I checked with Ensembl what the status was for GnomAD annotations in the VEP cache, since they had a declaration of intentions to incorporate them. Their answer:

Hi Matthias

You're correct that the original intention was to bring the gnomAD allele
frequencies into release 89. We rely on the data being submitted into dbSNP
before we can pull it through. We had assumed that the gnomAD data would be
fully submitted into dbSNP in time, which unfortunately it was not. For release
90, which is coming out next week, we will have gnomAD frequencies for all
pre-existing dbSNP variants in the database and in the VEP cache. However,
while the frequencies for existing variants have come through, the novel
variants haven't yet - we're still waiting on that.

All the best

Emily
Ensembl helpdesk

I'll be on vacation next week, but I'll make sure to update VEP and ggd recipes when I get back.

Cheers M

chapmanb commented 7 years ago

Matthias; Thanks much for this discussion. Incorporating through VEP makes a lot of sense. I don't know of a hg38 version of gnomad that folks are generally using. There is one in ANNOVAR that was prepared, I assume, through LiftOver from the gnomad GRCh37 release but I don't know it's available outside of ANNOVAR. The full gnomad release itself is quite large so having a pre-processed version makes a lot of sense. Thanks again.

matthdsm commented 7 years ago

Hi Brad,

Could you perhaps point me to the gnomad resources for hg19/hg38 from ANNOVAR? I don't seem to find anything mentioning this in the manual. Perhaps we might be able to use the prepared files from ANNOVAR in vcfanno? I suppose GnomAD for GRCh37 is actually a non issue, since it's included in GEMINI v0.20.0, but we're going to need an analog for hg38 if we want the annotations to be consistent across genome builds.

It really is too bad so many institutions are still stuck on GRCh37, while the 38 build is already several years old.

Cheers M

chapmanb commented 7 years ago

Matthias; I'm not sure if the ANNOVAR resources are publicly available, but they're listed in their documentation:

https://github.com/WGLab/doc-ANNOVAR/blob/master/user-guide/download.md#--for-filter-based-annotation

If you explore with that team and they make them available it would be great to coordinate this with our vcfanno/GEMINI preparations. Thanks again for looking into it.

brentp commented 6 years ago

did an hg38 version of gnomad ever make it into cloudbiolinux?

matthdsm commented 6 years ago

Hi Brent,

I suppose so, I think it only needed to be added as datatarget to bcbio.

Cheers M

Op 5 dec. 2017 om 20:11 heeft Brent Pedersen notifications@github.com<mailto:notifications@github.com> het volgende geschreven:

did an hg38 version of gnomad ever make it into cloudbiolinux?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/chapmanb/cloudbiolinux/issues/254#issuecomment-349409102, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALTTkLgxhSqJVQhUuvRHSPDp59PD39f9ks5s9ZVIgaJpZM4OxpZ4.

chapmanb commented 6 years ago

It did, sorry for leaving this issue open. We're using the version from Ensembl and mapping over the chromosomes to chr1 style that we're trying to stick with for hg38:

https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg38/gnomad.yaml

matthdsm commented 6 years ago

I've also just added the vcfanno config for gnomad to the ggd recipes, this was a small piece that was still missing. ref #260

Cheers M

pfpjs commented 6 years ago

Hi all,

I've been unable to get bcbio to install the gnomAD genome annotations, using the latest dev version. There seems to be no --datatarget gnomad option, and I can't seem to trigger using the gnomad.yaml ggd recipe.

How have you been able to install?

Thanks!

chapmanb commented 6 years ago

Paolo; Thanks much for the helpful feedback and sorry about the issues. Apologies, the gnomad downloads weren't linked to any download target so didn't have full bcbio integration. I added these to the set of files downloaded with gemini so now if you do:

--datatarget gemini

it should make them available along with ExAC, ESP and other inputs.

Please let us know if you run into any problems and thanks again.

pfpjs commented 6 years ago

Brad,

Many thanks, the download is now working, but there are still some problems.

First, I upgraded by running the following:

bcbio_nextgen.py upgrade -u development --tools --tooldir=/path/to/tools/ --genomes hg19 --genomes GRCh37 --genomes hg38 --genomes hg38-noalt --aligners bwa --aligners bowtie2 --aligners rtg --aligners twobit --data --datatarget variation --datatarget gemini --datatarget cadd --datatarget vep --datatarget dbnsfp --datatarget rnaseq --datatarget smallrna --datatarget dbscsnv --cores 24

It tried to run the hg19 gnomad GGD recipe, which is a symlink to GRCh37's gnomad.yaml, but failed due to having a reference to ../seq/GRCh37.fa.fai:

https://github.com/chapmanb/cloudbiolinux/blob/c7c41c2634d044c60abae2c2264ff7e9b6885485/ggd-recipes/GRCh37/gnomad.yaml#L14

Output:

Running GGD recipe: hg19 gnomad 2.0.1
--2017-12-27 10:51:34--  ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz
           => ‘-’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
decompose v0.5

options:     input VCF file        -
         [s] smart decomposition   true (experimental)
         [o] output VCF file       -

Logging in as anonymous ... > gsort version 0.0.6
Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/data_files/homo_sapiens/GRCh37/variation_genotype ... done.
==> SIZE gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz ... 27547061638
==> PASV ... done.    ==> RETR gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz ... done.
Length: 27547061638 (26G) (unauthoritative)

          gnomad.ge   0%[                    ]       0  --.-KB/s               2017/12/27 10:51:35 open ../seq/GRCh37.fa.fai: no such file or directory
sed: couldn't write 648 items to stdout: Broken pipe

I then removed --genomes hg19 from the command line to get it to skip directly to GRCh37, and another error popped up:

Running GGD recipe: GRCh37 gnomad 2.0.1
--2017-12-27 11:58:24--  ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz
           => ‘-’
Resolving ftp.ensembl.org (ftp.ensembl.org)... decompose v0.5

options:     input VCF file        -
         [s] smart decomposition   true (experimental)
         [o] output VCF file       -

> gsort version 0.0.6
193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/data_files/homo_sapiens/GRCh37/variation_genotype ... done.
==> SIZE gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz ... 27547061638
==> PASV ... done.    ==> RETR gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz ... done.
Length: 27547061638 (26G) (unauthoritative)

              gnoma   0%[                    ]  58.98M  2.46MB/s    eta 2h 53m 2017/12/27 11:58:48 unknown chromosome: chr1 (known: map[21:20 GL000234.1:44 GL000240.1:47 GL000237.1:53 GL000204.1:55 GL000228.1:60 GL000205.1:72 5:4 14:13 MT:24 GL000197.1:35 GL000244.1:41 GL000238.1:42 GL000208.1:57 GL000225.1:82 2:1 10:9 19:18 20:19 GL000246.1:37 GL000191.1:58 GL000223.1:75 GL000203.1:36 GL000249.1:38 GL000199.1:68 GL000200.1:79 GL000193.1:80 15:14 GL000229.1:27 GL000196.1:39 GL000232.1:45 GL000236.1:48 GL000241.1:49 GL000218.1:64 GL000217.1:69 GL000212.1:77 18:17 X:22 GL000209.1:63 GL000195.1:76 9:8 GL000235.1:31 GL000215.1:71 3:2 Y:23 GL000248.1:40 17:16 22:21 GL000207.1:25 GL000226.1:26 GL000243.1:50 GL000233.1:54 GL000222.1:78 16:15 GL000201.1:32 GL000214.1:61 GL000219.1:73 8:7 12:11 GL000230.1:52 GL000213.1:66 7:6 GL000245.1:34 GL000198.1:56 GL000221.1:62 GL000192.1:83 11:10 13:12 GL000206.1:46 GL000227.1:59 GL000211.1:67 6:5 GL000210.1:29 GL000239.1:30 GL000247.1:33 GL000202.1:43 GL000224.1:74 GL000194.1:81 1:0 4:3 GL000231.1:28 GL000242.1:51 GL000220.1:65 GL000216.1:70])
sed: couldn't write 811 items to stdout: Broken pipe

This one failed because, I think, because of trying to remap chromosomes from 1 2 3...X Y to chr1 chr2 chr3... chrX chrY, which is not needed for GRCh37, only for hg19:

https://github.com/chapmanb/cloudbiolinux/blob/c7c41c2634d044c60abae2c2264ff7e9b6885485/ggd-recipes/GRCh37/gnomad.yaml#L13

Thank you for looking into this! -- Paulo

chapmanb commented 6 years ago

Paulo; Thanks much for testing this and for the feedback. Apologies for the issues with the recipes. I've pushed fixes for these that should resolve the install issues and also swapped to have this be a separate install rather than included with gemini. The data size makes this a bit impractical to include along with other GEMINI associated data. If you update bcbio to the latest development version (bcbio_nextgen.py upgrade -u development) then re-run the data upgrade with --datatarget gnomad added I hope it will work cleanly for you. Please let us know if you run into any other issues.