Questions about CSEA - Githubissues

Hi,

I have managed to run the CSEA method, but I have several questions. I hope you have time to answer.

Can the "latent" data matrix in VISION be Principal Component Analysis (PCA) embeddings? If I have multiple replicates, is data integration recommended to remove batch effects, and should the PCA embeddings be batch-corrected?
In "Datasets.ipynb" notebook, are the count matrices that are provided as input raw counts, i.e. unnormalized?
What do these signatures mean: "ID_EXP_DOWN", "Patho-manual_DOWN". By the way, I find it quite strange that the signature names somehow change during the analysis. At some point "DOWN" is shortened as "DN", even though I am using the allsigs.txt signature file and not doing any changes to the names myself.
In "CSEA_All.ipynb" notebook, I manage to obtain significantly enriched signatures, such as

Pvalue corrected Pvalue Pvalue control Pvalue control corrected ES_max leading edge IL23+TGFB+IL6 vs TGFB+IL6 (Th17) 0 0 0 0 0.012492751 799

and when I visualize the signature using the "CSEA_TFH.ipynb" notebook, it looks like this:

However, when I calculate the p-value in the same notebook using the command np.mean(np.asarray([np.max(x) for x in control2[1:]]) > np.max(ES)), for some reason it is 1.

When I visualize the results on the latent_u embeddings (UMAP, batch-corrected), I don't see any difference between my two groups.

For comparison, this is the enrichment plot for "Th2", which is not supposed to be enriched based on the results. It looks very different compared to the first example.

Please refer to Vision documentation for detailed instructions, but yes PCA embedding is what Vision uses by default. PCA does not perform batch correction though, and batch correction is recommended. https://yoseflab.github.io/VISION/articles/VISION-vignette.html
Yes they are raw counts.
A lot of these genesets were manually generated, and the names might have changed for ease of reading at some point.
The difference between "CSEA_All.ipynb" and "CSEA_TFH.ipynb" is that they use a different control set. CSEA_ALL uses random genesets as control to save computational time, while CSEA_TFH uses genesets that are matched in mean to the TFH geneset. I believe that if you'd like to test a new signature with more confidence, you would need to generate your own matched random set. Also since CSEA is a directional test, I would take the significant result from both CSEA_X.ipynb and CSEA_X-Inverse.ipynb notebook

chenlingantelope / MSscRNAseq2019

Questions about CSEA #6