KevinMenden / scaden

Deep Learning based cell composition analysis with Scaden.
https://scaden.readthedocs.io
MIT License
71 stars 25 forks source link

Can you release the processed real bulk data for model validation? #80

Closed yfzon closed 3 years ago

yfzon commented 3 years ago

Hi Kevin, I come across your paper and find it really useful to deal with the bulk RNAseq data, which is exactly what I need. I wonder if you could release the processed real bulk data listed in Table S2 (PBMC1, PBMC2, Xin, ROSMAP, and Ascites)? So that I can explore the existing real bulk RNAseq data first, and repeat your experiment on the real bulk data.

Thanks ahead. Best, Fan

KevinMenden commented 3 years ago

Hi Fan,

unfortunately I can't just distribute that data, as some of it was given to me directly by the authors or has access restrictions. However, I did not do any processing to the bulk RNA-seq data, and I provided links to all datasets in the paper. So you can just download the count/ or normalized data and apply Scaden if you like.

Let me know if you have any issues or trouble downloading the data.

Best, Kevin

yfzon commented 3 years ago

@KevinMenden Hi Kevin, thanks for your reply. And I'm afraid that I have to ask you the details of each real bulk data collection process. I believe that it will also help others who have the same question.

KevinMenden commented 3 years ago

Hi @yfzon,

sure, if you have a look at the Methods section of the Scaden paper you should find all the details you need (see below for the part of interest).

For instance, the GEO accession for the PBMC2 dataset: GSE107011

Note also the part at the end: "For the RNA-seq datasets analyzed in this study, we did not apply any additional processing steps but used the obtained count or expression tables directly as downloaded for all datasets except the ROSMAP dataset"

As I said, we basically used the raw counts, except for the ROSMAP dataset.

Okay - let me know where and with which dataset you are struggling and then I surely can help you out!

Cheers, Kevin

Text from the Methods section:

"Tissue datasets for benchmarking. To assess the deconvolution performance on real tissue expression data, we used datasets for which the corresponding cell fractions were measured and published. The first dataset is the PBMC1 dataset, which was obtained from Zimmermann et al. (21). The second dataset, PBMC2, was downloaded from GEO with accession code GSE107011 (10). This dataset contains both RNA-seq profiles of immune cells (S4 cohort) and from bulk individuals (S13 cohort). As we were interested in the bulk profiles, we only used 12 samples from the S13 cohort from these data. Flow cytometry fractions were collected from the Monaco et al. publication (10).

In addition to the above mentioned two PBMC datasets, we used Ascites RNA-seq data. This dataset was provided by the authors, and cell type fractions for this dataset were taken from the supplementary materials of the publication (18).

For the evaluation on pancreas data, artificial bulk RNA-seq samples created from the scRNA-seq dataset of Xin et al. (20) were used. This dataset was downloaded from the resources of the MuSiC publication (8). The artificial bulk RNA-seq samples used for evaluation were then created using the “bulk_construct” function of the MuSiC tool.

To assess how Scaden and the GEP algorithms deal with the presence of unknown cell types, we generated PBMC bulk RNA samples from the four scRNA-seq datasets (6000 each). The undefined amount of unknown cells that was generated by this approach was removed to be replaced by defined amounts of 5, 10, 20, and 30% of unknown cells, respectively. Cell fractions of all four samples were predicted with Scaden trained on the other three.

Performance on these samples was then assessed to test robustness against unseen cell types in the bulk mixture. Scaden was trained on samples from all datasets but the test dataset, while CSx and MuSiC used data8k as a reference.

The microarray dataset GSE65133 was downloaded from GEO, and cell type fractions were taken from the original CS publication (6).

Last, we wanted to get insights into neurodegenerative cell fraction changes in the brain. While it is known that neurodegenerative diseases like AD are accompanied by a gradual loss of brain neurons, stage-specific cell type shifts are still hard to come by. Here, we use the ROSMAP study cortical RNA-seq dataset along with the corresponding clinical metadata, to infer cell type composition over six clinically relevant stages of neurodegeneration (22). Furthermore, to assess deconvolution accuracy on postmortem human brain tissue, we used 41 samples from the ROSMAP, for which cell composition information from immunohistochemistry (23) was recently released and for which fractions for all cell types were reported. The ROSMAP RNA-seq data were downloaded from www.synapse.org/. The cell composition values were provided by the authors of the study (23).

RNA-seq preprocessing and analysis. For the RNA-seq datasets analyzed in this study, we did not apply any additional processing steps but used the obtained count or expression tables directly as downloaded for all datasets except the ROSMAP dataset. For the latter, we generated count tables from raw FastQ files using Salmon (33) and the GRCh38 reference genome. FastQ files from the ROSMAP study were downloaded from Synapse (www.synapse.org)."