chenlingantelope / MSscRNAseq2019

Analysis for 2019 submission "Integrated single cell analysis of blood and cerebrospinal fluid leukocytes in multiple sclerosis" Schafflick1, Xu, Hartlehnert1 et. al
MIT License
21 stars 8 forks source link

Questions about CSEA #6

Open SirKuikka opened 1 year ago

SirKuikka commented 1 year ago

Hi,

I have managed to run the CSEA method, but I have several questions. I hope you have time to answer.

  1. Can the "latent" data matrix in VISION be Principal Component Analysis (PCA) embeddings? If I have multiple replicates, is data integration recommended to remove batch effects, and should the PCA embeddings be batch-corrected?
  2. In "Datasets.ipynb" notebook, are the count matrices that are provided as input raw counts, i.e. unnormalized?
  3. What do these signatures mean: "ID_EXP_DOWN", "Patho-manual_DOWN". By the way, I find it quite strange that the signature names somehow change during the analysis. At some point "DOWN" is shortened as "DN", even though I am using the allsigs.txt signature file and not doing any changes to the names myself.
  4. In "CSEA_All.ipynb" notebook, I manage to obtain significantly enriched signatures, such as

Pvalue corrected Pvalue Pvalue control Pvalue control corrected ES_max leading edge IL23+TGFB+IL6 vs TGFB+IL6 (Th17) 0 0 0 0 0.012492751 799

and when I visualize the signature using the "CSEA_TFH.ipynb" notebook, it looks like this:

image

However, when I calculate the p-value in the same notebook using the command np.mean(np.asarray([np.max(x) for x in control2[1:]]) > np.max(ES)), for some reason it is 1.

When I visualize the results on the latent_u embeddings (UMAP, batch-corrected), I don't see any difference between my two groups.

image

For comparison, this is the enrichment plot for "Th2", which is not supposed to be enriched based on the results. It looks very different compared to the first example.

image

chenlingantelope commented 1 year ago
  1. Please refer to Vision documentation for detailed instructions, but yes PCA embedding is what Vision uses by default. PCA does not perform batch correction though, and batch correction is recommended. https://yoseflab.github.io/VISION/articles/VISION-vignette.html

  2. Yes they are raw counts.

  3. A lot of these genesets were manually generated, and the names might have changed for ease of reading at some point.

  4. The difference between "CSEA_All.ipynb" and "CSEA_TFH.ipynb" is that they use a different control set. CSEA_ALL uses random genesets as control to save computational time, while CSEA_TFH uses genesets that are matched in mean to the TFH geneset. I believe that if you'd like to test a new signature with more confidence, you would need to generate your own matched random set. Also since CSEA is a directional test, I would take the significant result from both CSEA_X.ipynb and CSEA_X-Inverse.ipynb notebook