Estimating Usage of GEP in another data set?

DiegoSafian commented 6 months ago

Hi,

I wonder if there is an appropriate way to estimate the usage of GEPs in another dataset so that one can compare changes in usage in different conditions? For example, I estimate GEPs usage per cell class in data set A and I want to know the usage of these GEPs in data set B.

My best, Diego

dylkot commented 6 months ago

Hi, this is actually the topic of our recent preprint https://t.co/OexYxSnc3D

The code we use for doing this is here:

https://github.com/immunogenomics/starCAT

The step to package the output of cNMF for starCAT is a little bit of a work in progress but it is the build_reference() function on the development branch which you can optionally enable automatically in the consensus step with build_ref=True

Let me know if this makes sense or if you have questions!

DiegoSafian commented 6 months ago

Hi, Thanks for your response. I am actually running cnmf using the command line (installed pip install cnmf), but I cannot find the way to enable build_ref=True. Do I need to work on a Python environment to do it?

This is how I run it:

conda activate cnmf

echo "### Step 1: prepare" 
cnmf prepare --output-dir ./data --name 15_26_cNMF_5000 -c data_matrix.txt -k 15 16 17 18 19 20 20 22 24 26 --n-iter 250 --seed 14 --numgenes 5000 --total-workers 10

echo "### Step 2: factorize" 
cnmf factorize --output-dir ./data --name 15_26_cNMF_5000 --worker-index 0

echo "### Step 3: combine"
cnmf combine --output-dir ./data --name 15_26_cNMF_5000

echo "### Step 4: plot"
cnmf k_selection_plot --output-dir ./data --name 15_26_cNMF_5000

echo "### Step 5: consensus"
cnmf consensus --output-dir ./data --name 15_26_cNMF_5000 --components 17 --local-density-threshold 0.025 --show-clustering
cnmf consensus --output-dir ./data --name 15_26_cNMF_5000 --components 18 --local-density-threshold 0.025 --show-clustering
cnmf consensus --output-dir ./data --name 15_26_cNMF_5000 --components 19 --local-density-threshold 0.025 --show-clustering
cnmf consensus --output-dir ./data --name 15_26_cNMF_5000 --components 20 --local-density-threshold 0.025 --show-clustering
cnmf consensus --output-dir ./data --name 15_26_cNMF_5000 --components 22 --local-density-threshold 0.025 --show-clustering
cnmf consensus --output-dir ./data --name 15_26_cNMF_5000 --components 24 --local-density-threshold 0.025 --show-clustering

dylkot commented 6 months ago

Currently it is only on the development branch of the github (it will be moved to the main branch in the next few weeks hopefully). You can install it with pip like so:

pip install git+https://github.com/dylkot/cNMF.git@development

If you don't mind, let me know how it goes since this is something we are actively working on supporting.

DiegoSafian commented 5 months ago

Hi again, I tried it and it works perfectly fine and extremely fast! The results are good; however, the Usage % in the dataset B decreased quite a bit. On the other hand, I am probably asking too much because I am actually comparing single nuclei data in two different species, which can be more challenging due to differences in cell composition and gene expression capture. Still, it produces very coherent results. I would definitely keep using it. I am attaching a fig for you, so you can have an idea about the results example.pdf

Many thanks!

dylkot / cNMF

Estimating Usage of GEP in another data set? #84