Proper normalization for Proteomics and RNA-seq data

MonicaSteffi commented 1 year ago

Hi,

I am trying MOFA for integrating Proteomics and RNA-seq dataset. I run the following codes.

`dd_WO_batch<- DESeqDataSetFromMatrix(countData = as.matrix(readcounts_2),
                                     colData = coldata_1,
                                     design = ~ Group)
dd_WO_batch<-estimateSizeFactors(dd_WO_batch)
vst_normalized<-varianceStabilizingTransformation(dd_WO_batch)
vst_normalized_1<-assay(vst_normalized)
proteom<-read.table("Proteomics_without_imputation_replaced_significant.tsv",sep="\t",header=TRUE, check.names = FALSE,row.names = 1)
proteom_1 <- normalize_vsn(proteom)

data<-list(view_0=vst_normalized_1,
           view_1=proteom_1)
MOFAobject <- create_mofa(data)
data_opts <- get_default_data_options(MOFAobject)
model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 14
train_opts <- get_default_training_options(MOFAobject)
train_opts$convergence_mode <- "fast"
train_opts$seed <- 42
MOFAobject_1 <- prepare_mofa(MOFAobject,
                           data_options = data_opts,
                           model_options = model_opts,
                           training_options = train_opts
)
MOFAobject <- run_mofa(MOFAobject_1, outfile="MOFA2_trained.hdf5",use_basilisk = TRUE)

And I got the following warning message:

In .quality_control(object, verbose = verbose) : Factor(s) 1, 3 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total 'levels' between your samples, *sometimes* because of poor normalisation in the preprocessing steps.

As per the recommendation, I performed VST in both data sets. But I am not sure why I am getting this warning messgae

Could you please give a suggestion how to move on?

Regards Monica

gtca commented 1 year ago

Hey @MonicaSteffi,

It can be just the case that the total number of features is the strongest signal after the appropriate normalisation applied. In practice, if e.g. principal components plots for both individual modalities reflect data structure (cell types, etc.), it might be worth labelling these factors as technical (unless the total number of features reflects biology) and moving on with interpreting the other top factors.

yaolin101 commented 1 year ago

Hi gtca,

Quite an excellent tool! I gave MOFA2 the same number of features for integrating transcriptomics, proteomics, and lipidomics, but it still gives me this warning. any suggestions on this?

Kind regards, Yao

bioFAM / MOFA2

Proper normalization for Proteomics and RNA-seq data #101