Data format - Githubissues

fangwuwang commented 7 years ago

Sorry I could not come for the seminar today. I have downloaded the data but could not push to the repo because of the large data size. Any way to upload data with a size (100-200MB) to Github? For your information, the data format is as below: Bigwig for Bisulfite-seq (also defined hyper/hypo-me regions in bigbed format); Processed RNA-seq with quantitation on gene/transcript level in txt format (contig data in Bigwig also available if we don't trust their processing); Processed ChIP-seq in bigbed format (bigwig data also available but much bigger).

The available data for each sample is summarized in the exel doc uploaded earlier. I have all data on my PC already and will find a way to pass to everyone later.

acavalla commented 7 years ago

Hi Fangwu,

100-200MB is too big for GitHub. Also, for the data access summary posted, only four of the samples listed have the full complement of histone methylation data. Do we think this will be sufficient?

Annie

acavalla commented 7 years ago

We'd have to split it into separate files and upload them separately though, which is possible

fangwuwang commented 7 years ago

Some files each have over 100MB sizes. I am thinking to put up the links and you can easily find and download to your own PC. For the ChIP data, I think we will pool them with DNA methylation analyses for whichever available, unless we think that will be too much work load?

acavalla commented 7 years ago

1) File size. Amrit was talking about uploading data to GitHub and, if not, at least including in our repo a rundown of the data type and preprocessing already applied.

2) Trying to correlate DNA methylation with histone methylation and gene expression I think might be a bit much. DNA methylation especially is extremely specific to base rather than overall number of methylated bases (Pearson's correlation of mean methylation level to gene expression r ~ -0.3 [https://academic.oup.com/bioinformatics/article/32/17/i405/2450762/Higher-order-methylation-features-for-clustering]). Taking this into account would I think increase the workload. Plus, if that's the bisulphite seq column, that only allows us to use one more sample.

fangwuwang commented 7 years ago

Good comments. I just think that maybe we can do analyses on DNA and histone separately and each generate list of TFs which we can combine together. The ChIP data are only available for the two mature lineages, it might be interesting to infer the lineage-specific gene expression from histone status. I will try to upload (compress?) and if could not, I will include a file for detailed data description.

rawnakhoque commented 7 years ago

@fangwuwang Hi Fangwu, Would you please check the GSE87195 file? I don't see any MEP sample there. And could you please tell what are the differences between the following samples: RNA_D2_GMP_100 and RNA_D1_GMP_100?

RNA_D1_HSCbm_100 and RNA_D1_HSC_100?

RNA_D1_MLP0_100","RNA_D1_MLP1_100","RNA_D1_MLP2_100","RNA_D1_MLP3_100" ?

Are they replicates or different cell type? Can I consider them as replicate?

-Rawnak

fangwuwang commented 7 years ago

@rawnakhoque Sorry I did not notice that the MEP population was missing. Let's remove that population for now. Some samples showed "D1" and some showed "D2" which are the number of donors, so basically biological replicates. But it is tricky since some population only got one sample and some got two (GMP, MPP). I saw that Paul and Amrit had given some useful suggestions and will look into that as well.
"HSCbm" is HSCs from bone marrow and we don't need it since all other samples are from peripheral blood. There are four populations of MLP (0123) and we decide to look at MLP1 as a representative.
The other dataset (GSE76234) is from another paper and we can use that to test the reproducibility of our analyisis result.

rawnakhoque commented 7 years ago

@fangwuwang No problem. Thanks for you explanation. Now the data format is more clear to me. :)

rawnakhoque commented 7 years ago

@fangwuwang In the project proposal, point 3 of the aim and method section, you mentioned "convert/assign the transcript level to the gene level; ". Do you have any idea how to do that?

fangwuwang commented 7 years ago

@rawnakhoque I tried searched on the website and found this very useful page that lists all available tools and packages based on their functions. Check the quantitation category I am sure there will be good packages to do this task.

santina commented 7 years ago

Hey guys, since you're using publicly available dataset, the one thing you should put on GitHub is the download script for the data, or least describe where and how you get the data. Ideally, members of your group or other people looking at your repo can just follow the instruction and get a copy of that themselves.

So something like this:

# Download data 
source("http://www.bioconductor.org/biocLite.R")  # Source the bioconductor 
biocLite("GEOquery")  # Download the R package that allows you to download data from GEO, if you haven't done this 
library(GEOquery)  # Load the package 
paths_to_files = getGEOSuppFiles("GSE87195")  # use the package to get the data 

# Code or instruction on how to unzip the files
# .... 

# And some data processing script 
# ...

By running this script in R, each person can get a copy of that script.

@fangwuwang you can then write another data processing script. Ideally, anyone running this script will get exactly the same data as you, that way you don't need to check back and forth with others.

fangwuwang commented 7 years ago

http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html#output

fangwuwang / team_Bloodies

Data format #4