almaan / stereoscope

Spatial mapping of cell types by integration of transcriptomics data
MIT License
87 stars 26 forks source link

Time and usage for Visium data #1

Closed giovp closed 4 years ago

giovp commented 4 years ago

Hi,

really nice project! I would like to try it out on Visium data. I have a mouse kidney dataset of this size:

[2020-01-17 17:24:27,162 - stsc - INFO ] >> SC data GENES : 2399  SC data CELLS : 19101  SC data TYPES : 27

and I run it w/ standard parameters:

stereoscope run \
        --sc_cnt ./../res/mouse_kidney/sc_counts.tsv \
        --sc_labels ./../res/mouse_kidney/sc_clusters.tsv \
        --st_cnt ./../res/mouse_kidney/st_counts.tsv \
        -o ./../res/stereoscope_mouse_kidney \
        -sce 75000 \
        -n 5000 \
        -ste 75000 \
        -stb 100 \
        --gpu \
        -scb 100

On GPU, fitting takes about a second per epoch (~20 hrs in total to fit the scRNAseq dataset), making quite slow for the analysis. I was wondering whether you already tried it on visium data and have therefore a better hyperparam set that could speed up the analysis.

Thanks in advance! Giovanni

almaan commented 4 years ago

Hi Giovanni, glad to hear that you found the method interesting.

From what I understand of your setup it is not the usage of Visium data that makes the process slow per say but rather fitting the single cell data. As of now I would not deem 20 hours all to bad for the single cell data, but there are a few things you could do cut down the time required for the analysis to complete. These are :

  1. Increase your batch size, depending on the memory available on your gpu, you could for sure use a batch size of something like 2048 or 4096, that should reduce some of the overhead of moving the data between cpu and gpu.

  2. Reduce the number of epochs, whilst 75k epochs were used in some examples this is usually far more than required. You usually see extremely fast convergence for both sets of data. It's not rare that 25k epochs are sufficient for good results. What you could do is to run the analysis with a higher number of epochs specified but then inspect how the loss changes over each epoch (using the progress module for example, see the README for more info), and simply cancel the analysis (using Ctrl+C or the equivalent) when you deem the system to have reached convergence.

  3. If you are reluctant to the suggestions above, or simply what to decrease the time even more one suggestion is to subsample your data in a similar fashion to what is presented for the mouse data. Applying the method to multiple different kinds of tissue and single cell data set, it's become evident that as few as ~25 cells usually is enough to estimate good parameters for the cell type. Hence you could reduce the number of single cells in your data set and thus speed up the analysis to some extent.

Let me know if you have any further questions!