Unusual running output relevant to the feature number in anndata and the running speed

LiuCanidk commented 1 week ago

@Ruoqiao2020 Hi, thanks for developing this tool! When I run SPIDER with my own seurat object, the output seemed strange that only 1000 features of my seurat object were retained but obviously these were not all features, even no variable features existed. I am sure that I have done all QC and basic data preprocessing steps, including filtering, normalization, scale, find HVGs, clustering. I wonder what is wrong.

Also, it seems slow if SPIDER need to process 74253 cells (~60 hours, it seems linearly scaling with 1200 cells of example data), is there any way to accelerate the computing process? I did not find any parallel options in the function SPIDER_predict. However, I notice the GPU and TPU option in the output, how can I use the GPU?

Any advice or discussion would be greatly appreciated!

LiuCanidk commented 1 week ago

Also, when I run SPIDER with example dataset "RNA", it also retained only 1000 features, is that due to only 1000 features in the pre-trained model were used?

Ruoqiao2020 commented 1 week ago

About the 1000 features: Your guess is right, SPIDER only utilize 1000 RNA features for prediction, and the 1000 RNA features are selected based on the pre-trained data, not the query data, and the output message you see is completely normal.
About your model runtime: Your runtime is indeed abnormal. It's probably an issue related to your server. In your screenshots, the output massages show the runtime for the scArches-SCANVI embedding step, only after this embedding step is finished, will SPIDER continue to perform the protein prediction steps. I created pseudo RNA data of ~75000 cells, which is around the same size as your data, and then tested on a typical Mac laptop, the scArches-SCANVI embedding step only takes 1 hour and 20 minutes, and the total runtime for SPIDER is around 30 hours. Therefore, if you use a server, which is usually more powerful than a laptop, you should see the SPIDER runtime to be < 30 hours. However, in your screenshots, just the scArches-SCANVI embedding step already takes 70 hours, which is much slower than a typical laptop.

Below is the normal runtime of the scArches-SCANVI embedding step on a typical laptop, just around 1 hour and 20 minutes:

Screenshot 2024-09-13 at 4 04 32 PM

It's the same case shown in your second screenshot of running the example dataset, which contain 1239 cells. On a typical laptop, the scArches-SCANVI embedding step should only takes 1-2 minutes (and the total SPIDER runtime should be around 1 hour). However, in your second screenshot, just the scArches-SCANVI embedding step already takes 1 hour on your server, which is much slower than a typical laptop and abnormal.

Are there other tasks running on your server that takes up your CPUs? This is a possible reason why your runtime is abnormal.

LiuCanidk commented 1 week ago

Hi, @Ruoqiao2020 , thanks for your detailed reply I am not sure about the server CPU efficiency because it is a distributed server which is shared among groups of users. I doubt maybe due to some basic setting issues about multi-node or multi-users. Anyway, it doesn't matter that it took 70 hours, which I can accept.

About the speed, can I use GPU to accelerate the running process? or, in fact, I did not see any parameters to set the process as multi-thread or multi-core, so I guess it just offer more memory size in the server but won't be faster than the task running on a laptop?

Ruoqiao2020 commented 1 week ago

We plan to add GPU options in SPIDER in the future. Currently SPIDER only uses CPUs by default. Still, at current stage, if you want to reduce the runtime, one way is to divide your RNA data into smaller subsets, run one SPIDER task on one RNA subset, and launch the tasks together on your server (not sure how much it will accelerate on your server, though, since your server's current speed is just abnormally slow). I would also suggest that you check if there are other tasks running on your server that takes up your CPUs by using the htop command in your terminal after you connected to your server (because your current runtime is indeed abnormal).

LiuCanidk commented 1 week ago

Thanks for your reply. So SPIDER does not depend on cell-cell community (e.g., knn graph) to infer cell surface protein? Can I split my seurat object into individual smaller datasets, e.g., 1000 cells per subset?

Ruoqiao2020 commented 1 week ago

SPIDER does not depend on knn graph to infer cell surface protein. Yes, you can try with 1,000~10,000 cells per subset.

LiuCanidk commented 1 week ago

SPIDER does not depend on knn graph to infer cell surface protein. Yes, you can try with 1,000~10,000 cells per subset.

Got it, thank you for the patient reply

Bin-Chen-Lab / spider

Unusual running output relevant to the feature number in anndata and the running speed #3