Training data - Githubissues

FunctionLab / ExPecto

predicting expression effects of human genome variants ab initio from sequence

121 stars 41 forks source link

Training data #8

Closed ChongWu-Biostat closed 5 years ago

ChongWu-Biostat commented 6 years ago

Hello,

I was very interested in your new method to predict gene expression. In your Nature Genetics paper, you mentioned the first component of ExPecto (deep learning part) is based on the training data from the ENCODE and Roadmap Epigenomics projects. I was wondering if you describe more about which exactly data you used. Do you run any pre-processing step before using the data? Can you release the download link or provide a tutorial about how to get the data? Thank you for your help in advance!

Thanks, Chong

jzthree commented 6 years ago

Hi Chong,

You can get the files from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseUniform/ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/ https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/ (I downloaded from a different URL http://www.broadinstitute.org/~anshul/projects/roadmap/peaks/stdnames30M/combrep but that is not available now)

The data processing is as described in our nature methods paper on DeepSEA.

If you just want to use processed data that we use for training, you can get it here https://www.dropbox.com/sh/h7frboas0c5ofy1/AAAx4NFxtRDDchRhteYA_jsBa?dl=0

Best, Jian

ChongWu-Biostat commented 6 years ago

Hi Jian,

Thank you for your generosity and sharing the data. We really appreciate that. Thank you!

Thanks, Chong

ChongWu-Biostat commented 5 years ago

Hi Jian,

I have one more questions. It seems the data in the Dropbox only contained phenotype data. Can you tell me how can I get the genotype data? Thank you for your help and time. We really appreciate it.

Thanks, Chong

jzthree commented 5 years ago

Hi Chong,

It does include both sequences and labels ( note there are two variables in each mat file, the one with size 2Nx4x2000 has the sequences). Also, Sequences are encoded in one-hot encoding with the order of AGCT. In each file, the sequences include both forward (first half) and its reverse complement (second half).

Best, Jian

ChongWu-Biostat commented 5 years ago

Hi Jian,

Thank you for your quick and very helpful reply. I fully understand the processed data now. Thank you for sharing the data. We really appreciate it.

Thanks, Chong