WGSA annotation for noncoding variants in WGS studies

naumenko-sa commented 5 years ago

Hello, bcbio community!

Thanks for the great framework!

Gnomad_genome frequencies help to prioritize variants in WGS studies. However, it would be nice to have functional prediction and conservation scores for noncoding variants.

For now, many scores come from dbNSFP, but, by definition, this database is for nonsynonymous (i.e. coding + not synonymous) variants and splice sites variants only (it contains 83,189,732 records). https://sites.google.com/site/jpopgen/dbNSFP

For non coding variants the same group proposes to use WGSA: https://sites.google.com/site/jpopgen/wgsa "For SNV-centric resources, WGSA integrated 12 sets of functional prediction scores (CADD, FATHMM-MKL, FATHMM-XF, Funseq, Funseq2, RegulomeDB, DANN, fitCons x 4, GenoCanyon, Eigen & Eigen-PC, GenoSkyline-Plus x 127, LINSIGHT), 9 conservation scores (bStatistic, GERP++, PhyloP x 3, phastCons x 3, SyPhy), allele frequencies from 5 large-scale re-sequencing studies (1000G, EP6500, ExAC, UK10K, gnomAD), variants in 4 disease related databases (ClinVar, COSMIC, GWAS_catalog, GRASP2), among others (see list of resources)."

Are there any plans to introduce WGSA to bcbio? The dataset is so huge (1.4T, which is 2-3 times more than most bcbio installations with human/mouse genomes), that, probably, the local installation of WGSA is not an option. But what about accessing through Amazon Web Service? Does it look like something feasible (https://sites.google.com/site/jpopgen/wgsa/using-wgsa-via-aws)?

Thanks! Sergey

chapmanb commented 5 years ago

Sergey; Thanks for starting this discussion. This looks fairly unwieldy to deal with and given how tricky dbNSFP has been I'm worried about the amount of effort to make this happen. I'd love to have better prioritization for non-coding variants but also worried about this approach of enumerating every position. The AWS approach looks like setting up something custom within the context of a project but maybe not the best target for bcbio to automate and support. How were you envisioning this all happening? Do you have any ideas how we can do this in a useful way without needing to mess with this gigantic files? Thanks again.

naumenko-sa commented 5 years ago

Thanks Brad, In particular, I needed a GERP++ score. For this score there is also a small file (17Mb) with conserved elements. http://mendel.stanford.edu/SidowLab/downloads/gerp/. It could be easily recoded as a bed file and used in vcfanno. Probably, similar approach might work for every score: transforming the values into a discrete variable, i.e. binning, and then creating a bed file where the genome will be split into elements. SN

chapmanb commented 5 years ago

Sergey; Thanks for this. The gerp_elements files are a component of GEMINI inputs but unfortunately only available for build 37 so I hadn't ported them over to the generalized vcfanno support in bcbio and CWL since I was trying to focus on shared resources also available for build 38. Is there any equivalent scores that would be useful and are also updated for the latest build we could include? Thanks again.

naumenko-sa commented 5 years ago

Thanks Brad!

I have not noticed that GERP conserved elements are already in gemini bundle. Now I see! This works perfectly well for me, as I'm still on grch37 and standalone bcbio installation. For grch38 I can only propose to use phastcons20way, phylop20way scores from UCSC browser: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ http://genome.ucsc.edu/cgi-bin/hgTables They also will be small interval files, like gerp elements, but they are updated for grch38.

Sergey

roryk commented 5 years ago

Thanks, @naumenko-sa do you think this is something that would be useful still?

bcbio / bcbio-nextgen

WGSA annotation for noncoding variants in WGS studies #2587