Most ideal normalization space for expression data?

rgalvin516 commented 3 years ago

Hello,

Firstly, thank you for compiling this resource for the public. The whole work is an incredible inspiration to me and others. I have been using gene expression data and exploring these immune subtypes. I noticed that there is a significant difference in best immune subtype calls depending on whether I run RSEM expected counts vs TPM gene expression data. The TPM data results in a much higher proportion of C4 in my data. I was wondering if in theory, one or the other would be more "valid"?

Thanks for your time

Gibbsdavidl commented 3 years ago

Hi, thanks for writing. Glad to hear it's hopefully useful.

The way I built the classifier, the method of gene quantization should not matter. TPM, RPKM, etc, should all be identical.

Now, I am going to guess here. But, while the classifier requires no normalization, it is actually sensitive to, and harmed by, gene-wise normalization. What I mean is: If you take a matrix of gene expression with genes in columns and samples in rows, and normalize the gene expression across samples (median scaling for example), the classification is not valid. And as I've seen it, often produces C4 subtype calls.

This is because, the classification is now based on comparing pairs of genes and producing binary features. Internally to a sample, if gene_x > gene_y the feature=1 else feature=0. Now, if you normalize genes across samples, the gene-gene comparison has been changed as the values have been modified.

Does this help at all? I've been working (not much recently) on a paper that explains the classifier. It's here:
https://www.biorxiv.org/content/10.1101/2020.01.17.910950v1

rgalvin516 commented 3 years ago

Thanks a lot for your response, I have sent a message to the team at Xena to ask if this matrix involves normalization of gene expression across samples. If I hear back, I will be sure to let you know the answer!

Gibbsdavidl commented 3 years ago

Thanks, please do! If we have a puzzling set of results, I'd like to get it figured out. ;-)

On Fri, Oct 23, 2020 at 2:40 PM Robert Galvin notifications@github.com wrote:

Thanks a lot for your response, I have sent a message to the team at Xena to ask if this matrix involves normalization of gene expression across samples. If I hear back, I will be sure to let you know the answer!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CRI-iAtlas/ImmuneSubtypeClassifier/issues/9#issuecomment-715602398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJFSNA5VFOWBWNRJX37QLSMHZ3XANCNFSM4S4VRCEQ .

CRI-iAtlas / ImmuneSubtypeClassifier

Most ideal normalization space for expression data? #9