biocore / tcga

Microbial analysis in TCGA data
BSD 3-Clause "New" or "Revised" License
88 stars 44 forks source link

data format requirement #19

Closed leiwaaping closed 4 years ago

leiwaaping commented 4 years ago

when I checking the batch analysis code _tcga/r_scripts/All_Tumor_batchanalysisFA.R , the input file was not mentioned. may I know how those *FA.RData looks like? Are they just read count / normalized number, or some data like metadata sample info, or Covariance? Meanwhile, is "Voom" data set means only process the Raw Counts with log, and "Voom-SNM" to process Raw Counts with log and quantiles normalization?

thanks for your time to answer this

gregpoore commented 4 years ago

@leiwaaping Thanks a bunch for the question and apologies for the delay. The batch correction script is actually hosted as a Jupyter Notebook file tcga/jupyter_notebooks/TCGA Batch Correction -- Final Analysis.ipynb (link here). The R script that you referenced was used to create the bar plot shown in Fig 1e of the paper and simply imported the PVCA data from the jupyter notebook and reformatted for plotting in R. For inputs, on that jupyter notebook, cell 5 that states the following:

%%R
load("tcgaVbDataAndMetadataAndSNM.RData")

The .RData file contained the raw count version of the Kraken taxonomy data and the corresponding TCGA metadata. In general, the .RData file type was used for neatly encapsulating multiple R objects and sharing them easily (e.g. between a local computer and a server for running the machine learning analyses). The Kraken data and metadata are called later in the script as vbDataBarnDFReconciledQC (i.e. the quality-control filtered raw microbial data) and metadataSamplesAllQC (i.e. the corresponding quality-control filtered metadata), both of which had 17,625 samples as rows and microbial or metadata features as columns. These same data are available on the paper's FTP link:

  1. Metadata: ftp://ftp.microbio.me/pub/cancer_microbiome_analysis/TCGA/Kraken/Metadata-TCGA-Kraken-17625-Samples.csv
  2. Raw Kraken count data here: ftp://ftp.microbio.me/pub/cancer_microbiome_analysis/TCGA/Kraken/Kraken-TCGA-Raw-Data-17625-Samples.csv

Those R objects were then used in cell 8 to build the model matrix and perform Voom:

%%R
qcMetadata <- metadataSamplesAllQC # metadataSamplesAllQCAML
qcData <- vbDataBarnDFReconciledQC # vbDataBarnDFReconciledQCAML

# Set up design matrix
covDesignNorm <- model.matrix(~0 + sample_type +
                                  data_submitting_center_label +
                                  platform +
                                  experimental_strategy +
                                  tissue_source_site_label +
                                  portion_is_ffpe,
                              data = qcMetadata)
[...]

vdge <- voom(dge, design = covDesignNorm, plot = TRUE, save.plot = TRUE, normalize.method="none")

Do note that errors were flagged based on some metadata columns, so entries were concatenated/cleaned up using gsub (as shown in the script). Next, the expression component of the vdge object was then fed into SNM in cell 11 after building the model matrices in cell 10, and then the normalized microbial data matrix was extracted and saved (NB: depending on the computer setup you have, the SNM algorithm may run for up to 30 minutes):

%%R
snmDataObjSampleTypeWithExpStrategyFA <- snm(raw.dat = vdge$E, 
                                            bio.var = bio.var.sample.type, 
                                            adj.var = adj.var, 
                                            rm.adj=TRUE,
                                            verbose = TRUE,
                                            diagnose = TRUE)
snmDataSampleTypeWithExpStrategyFA <- t(snmDataObjSampleTypeWithExpStrategyFA$norm.dat)

Per your second question: Voom is a separate algorithm from SNM. Voom was published in Genome Biology (link here) and has been cited >2000 times for transforming discrete count data into microarray-like data, such that tools originally developed for microarrays (e.g. limma or snm) can be used on the transformed data. SNM is an approach that was originally developed for microarrays (paper here) to measure and remove batch effects in a supervised manner. We thus used Voom as a tool to transform the discrete count data into microarray-like data, followed by SNM to remove the batch effects (thus labeled as Voom-SNM).

I hope that helps! Let me know if you have further questions.

leiwaaping commented 4 years ago

Dear Greg,

Thanks for getting back to me. I really appreciate your detailed message, it do help a lot.

Regards, Becca

Greg Poore notifications@github.com 於 2020年5月2日 上午6:41 寫道:

 @leiwaaping Thanks a bunch for the question and apologies for the delay. The batch correction script is actually hosted as a Jupyter Notebook file tcga/jupyter_notebooks/TCGA Batch Correction -- Final Analysis.ipynb (link here). The R script that you referenced was used to create the bar plot shown in Fig 1e of the paper and simply imported the PVCA data from the jupyter notebook and reformatted for plotting in R. For inputs, on that jupyter notebook, cell 5 that states the following:

%%R load("tcgaVbDataAndMetadataAndSNM.RData") The .RData file contained the raw count version of the Kraken taxonomy data and the corresponding TCGA metadata. In general, the .RData file type was used for neatly encapsulating multiple R objects and sharing them easily (e.g. between a local computer and a server for running the machine learning analyses). The Kraken data and metadata are called later in the script as vbDataBarnDFReconciledQC (i.e. the quality-control filtered raw microbial data) and metadataSamplesAllQC (i.e. the corresponding quality-control filtered metadata), both of which had 17,625 samples as rows and microbial or metadata features as columns. These same data are available on the paper's FTP link:

Metadata: ftp://ftp.microbio.me/pub/cancer_microbiome_analysis/TCGA/Kraken/Metadata-TCGA-Kraken-17625-Samples.csv Raw Kraken count data here: ftp://ftp.microbio.me/pub/cancer_microbiome_analysis/TCGA/Kraken/Kraken-TCGA-Raw-Data-17625-Samples.csv Those R objects were then used in cell 8 to build the model matrix and perform Voom:

%%R qcMetadata <- metadataSamplesAllQC # metadataSamplesAllQCAML qcData <- vbDataBarnDFReconciledQC # vbDataBarnDFReconciledQCAML

Set up design matrix

covDesignNorm <- model.matrix(~0 + sample_type + data_submitting_center_label + platform + experimental_strategy + tissue_source_site_label + portion_is_ffpe, data = qcMetadata) [...]

vdge <- voom(dge, design = covDesignNorm, plot = TRUE, save.plot = TRUE, normalize.method="none") Do note that errors were flagged based on some metadata columns, so entries were concatenated/cleaned up using gsub (as shown in the script). Next, the expression component of the vdge object was then fed into SNM in cell 11 after building the model matrices in cell 10, and then the normalized microbial data matrix was extracted and saved:

%%R snmDataObjSampleTypeWithExpStrategyFA <- snm(raw.dat = vdge$E, bio.var = bio.var.sample.type, adj.var = adj.var, rm.adj=TRUE, verbose = TRUE, diagnose = TRUE) snmDataSampleTypeWithExpStrategyFA <- t(snmDataObjSampleTypeWithExpStrategyFA$norm.dat) Per your second question: Voom is a separate algorithm from SNM. Voom was published in Genome Biology (link here) and has been cited >2000 times for transforming discrete count data into microarray-like data, such that tools originally developed for microarrays (e.g. limma) can be used on the transformed data. SNM is an approach that was originally developed for microarrays (paper here to measure and remove batch effects in a supervised manner. We thus used Voom as a tool to transform the discrete count data into microarray-like data, followed by SNM to remove the batch effects (thus labeled as Voom-SNM).

I hope that helps! Let me know if you have further questions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.