ctlab / LinSeed

Linseed: LINear Subspace identification for gene Expresion Deconvolution
MIT License
28 stars 8 forks source link

Weird cell proportions #8

Open pushtiks opened 4 years ago

pushtiks commented 4 years ago

Hi! I'm testing linseed on RNAseq dataset. And I run into issue where linseed predicted proportions in summary gives >1, like:

                  sample1       sample2      sample3

Cell type 1 0.6073787 0.5740409 1.027164e+00 Cell type 2 0.5784655 0.6328196 7.704742e-06

Maybe you have any idea what could go wrong.

I'm using R 3.6.0, edgeR CPM matrix and there is only 2 cell types per sample. I'm planning to try TPMs instead, will update if there is or there is not any changes.

keremw commented 4 years ago

Hi, I ran into the same thing. The proportions don't add to one and I have used TPM. Any idea why that would happen? Should I just normalize the samples to one? Thank you for a great tool, Kerem

pushtiks commented 4 years ago

@keremw, I got an answer from authors to my question via e-mail:

"this is not uncommon situation for predictors to give values that don't sum to one - this constraint is not imposed in exact way in the DSA and other similar algorithms.

I advise you to dig into the math of a process (both LinSeed or DSA etc) a little deeper. "

I tried it with TPMs and got the same result. Also, I didn't have time to dig deeper into LinSeed/DSA math and just switched to another tool.

konsolerr commented 4 years ago

@pushtiks @keremw

sorry for the slow replies on my side

Weird proportions come from the fact that in both LinSeed and DSA we kinda try to solve sum-to-one constraint as good as we can, yet, we do not force the proportions to be exactly sum-to-one.

@keremw, if for your analysis sum-to-one is indeed required you can just try to force sum-to-one for the columns of the proportions matrix.

@pushtiks " Also, I didn't have time to dig deeper into LinSeed/DSA math and just switched to another tool.". Sorry to hear that, I will try to maintain the package better from now on.

Cheers, Konstantin

pushtiks commented 4 years ago

@konsolerr Konstantin, thank you for your reply! I'll also try to force sum-to-one of the proportion matrix then.

Could the issue be due to the "spill-over" effect, when the one cell type rise the proportions of another one? Also, for example, we have cell-types A, B, and C. The C has sub-types C1 and C2, where C2 has a similar expression with type B. Is the "proportion" of C2 collapsed with both C and B cell-type proportions, or is it considered as ambiguous and thus removed from the analysis? So if the C2 is collapsed with two cell-types it could also make sum of proportions >1. These are just guesses of possible causes as I didn't dig into algorithms.

Cheers, Vickie

konsolerr commented 4 years ago

@pushtiks

Vickie, the inability to fit sum-to-one constraint is rather technical: before we calculate the proportions we first find "pseudo-proportions" these are vectors changes in which correspond to proportions, however, they are found in different space, thus, pseudo-proportions won't fit sum-to-one constraint. After that we actually try to find such coefficients that would make our "pseudo-proportions" look more like actual proportions by solving linear equations to fit sum-to-one constraint. But these equations can not be solved accurately in most cases, so we "approximate" sum-to-one constraint but never force it.

I like your questions about cell types and subtypes!

Before answering your question I want to change a bit your perspective on the cell types and the signatures of the cell types. When you have many cell types, it is easier to think in perspective of genes, signature genes and shared genes. Signature genes are defined by higher expression in one cell type while shared genes might have similar expression levels in several cell types.

Assume you have A, B, C1, and C2 (I assume that C2 is somewhat close to C1 but also have some expression signatures from B, if I read your example correctly). What LinSeed does is trying to identify linear subspace to put all the genes with respect to their expression in pure cell types. Imagine tetrahedron A-B-C1-C2 where we put all the genes with higher expression in A closer to vertex A (the same with B, C1, C2), all the housekeeeping genes will be in the middle of this tetrahedron. Now comes interesting case: we know that C1 and C2 are transcriptionally similar - it means that genes that are higher expressed in C will be somewhere in the middle of the edge C1-C2 of this tetrahedron, and if C2 has similar expression with B we will have some genes on the edge C2-B.

Now back to your question: if you have signature genes for C2 and C1, you will have marker genes in the corners of the tetrahedron, which means that you can just run the analysis with 4 cell types and you don't have to worry about "collapsing" C2 into C1 and B. However, if you run the analysis for 3 cell types it's hard to guess what the triangle projection will be. If C1 and C2 are very similar and C2 has just "some genes" shared with B than you will find you will find proportions of A, B, C1+C2. Otherwise it's kinda hard to tell.

Cheers, Konstantin

pushtiks commented 4 years ago

@konsolerr Konstantin, thank you very much for the answers! The picture is a lot clearer now, the "pseudo-proportions" part and importance of correct cell-type number are the pieces I missed. I think I have a plan on how to proceed further with my data and LinSeed. Thank you and have a nice summer! Vickie