WalhoutLab / MERGE

A new computational pipeline that can be used to convert expression (RNA-seq) data into predicted flux potentials that indicate metabolic function
MIT License
1 stars 0 forks source link

Integrating count data #4

Closed kamoors closed 2 years ago

kamoors commented 2 years ago

Hi, again!

I have come back to your workflow, but have a question about integrating count data.

I want to compare multiple CS models using FVA later on (different conditions from the RNA-seq experiment), so my question is do I need to normalize the count data before applying imat++ on it? In your paper you mention that the conditions are being compared, thus I would assume that if I don't normalize beforehand, this comparison will be problematic?

FYI, I would normalize using DESeq2 integrated normalization, as this is the same thing they used for DE analysis.

Thanks again!

kamoors

XuhangLi commented 2 years ago

Hi there!

This is a very good question! I am glad you asked. We routinely use iMAT++ on bulk RNA-seq data with diff. conditions. The best practice in my opinion is to use the batch-effect-removed, depth-normalized count as input.

So, you can use batched effect removed TPM (e.g. use limma to remove batch effect on logTPM, and then exponential transform to get the TPM) or normalized count from DEseq2 (if you don't need to remove batch effect). The normalization by DEseq2 is more advantageous than TPM if your condition has a lot of genes differentially expressed.

In our paper, we used simply TPM (because it is in fact a single cell data). But yes, normalized count from DEseq2 is a good option.

Please let me know if you have any questions!

Hang

kamoors commented 2 years ago

Thank you for your super quick reply :)

I did the normalization with DESeq2, and now I also generated the categories using the CatExp.py script. In your experience, is it ok to use this when comparing different conditions (instead of tissues)? Does it not remove some differences that we would actually be looking for?

XuhangLi commented 2 years ago

Hi,

The CatExp works well with RNA-seq data for conditions, although caution should be paid to the categories. The problem is not on conditions vs. tissues, instead, it is on the bulk-RNA-seq vs. single-cell seq. The bulk data may not give you a good fit with a bimodal Gaussian. In addition, it would be wise to carefully evaluate the thresholds and make sure they are biologically meaningful, i.e., a rare-low threshold should be a sufficiently small number, say a normalized count (in DEseq2) of 5 (in terms of TPM, say TPM<0.5).

So, if the fitting quality is good, and your thresholds are reasonable. I think it will be a good start.

For your second question, I am not sure what you mean by 'removing the difference that we actually look for'. But one point is that the iMAT++ is indeed a coarse-grained analysis for bulk data, as many of the quantitative changes will be ignored (i.e., a gene changing from 100TPM to 1000TPM will not impact its category - both conditions will be in highly expressed category). So we would recommend checking out the FPA in the paper if you are interested.

Hang