Swarbricklab-code / BrCa_cell_atlas

Data processing and analysis related code associated with the study "A single-cell and spatially resolved atlas of human breast cancers".
107 stars 47 forks source link

How did you combine the pseudobulk and TCGA before hierarcheal clustering? #6

Closed ZeroLi-Bio closed 2 years ago

ZeroLi-Bio commented 2 years ago

Hi! I am wondering how did you combine the pseudobulk data and TCGA data before hierarcheal clustering? Since I couldn't find these codes in the repo. Could you direct me to how to normalize the two datasets from different platforms?

Thank you

dlroden commented 2 years ago

Hi, thanks for your query, We use both R and Cluster 3.0 software for bulk RNA clustering, we don't have one specific code for this but here is the the instructions for the cluster 3.0 part and we have added a supporting R script. We used the following methodology for both pseudo-bulk and TCGA RNA-Seq datasets:

1) Upper-quartile normalization (See R code here) - It's important to supply the input file as a matrix without gene names and sample Ids)) 2) Log transformation (Cluster 3.0 - Open the Upper quartile normalized data, filter the genes by clicking options %Present >=80% and At least 1 observations with abs(Val) >= 2.0. Apply filter and accept filtered genes. Then click on "Adjust data" Tab and tick option Log Transform Data. Hit Apply) 3) Gene median centering (In the "Adjust data" Tab itself, now click Center genes, option Median and hit apply. Save this file out) 4) Column standardization (See R code here)

This methodology should be done individually on both datasets and then merge the two datasets on the "intrinsic gene list". For clustering, we again use cluster 3.0. Open the normalized-filtered-median centered-standardized file, click the "Hierarchical" column and tick Genes and array. Use option "Correlation (centered)" for similarity metric and finally choose centroid linkage in the clustering method.

The software will now create 4 files that you can visualize using Treeview. Both Cluster 3.0 and Java Treeview are freely available for download.

In short, you need to separately perform upper-quartile normalization, log transformation, gene median centering and finally column standardization before clustering the 2 bulk datasets.

Thanks