Error: cannot find "study" in the seurat object

LiuCanidk commented 1 month ago

Hi, thanks for developing this nice tool

I encountered an error when I try to run SPIDER on my own seurat data, and this is my code

SPIDER_predict (seurat_data = sce, tissue = 'hela', disease = 'cell_line', SPIDER_model_file_path = paste0(prefix, '/SPIDER_python/SPIDER_weight/'), use_cell_type = 'SingleR', protein = 'All', use_pretrain = 'T', #Using pretrained SPIDER save_path = paste0(prefix, '/SPIDER_results/'), use_python_path = '/work/home/liucan666/software/mamba/mambaforge/envs/SPIDER/bin/python3', scarches_path = paste0(prefix, '/scarches-0.4.0/'))

and it reported an error: Cannot find 'study' in this Seurat object

I then checked the example data "RNA", and found the column of "study" in the seurat metadata: "healthy pancreas"

My seurat object indeed had no "study" metadata, but I did not find any relevant instructions in the tutorial, so could you please tell me what the "study" column is designed for and how I can set this metadata for my own seurat object, or just set a study name as I like?

Thanks in advance

Ruoqiao2020 commented 1 month ago

Thank you for using SPIDER!

On our GitHub page, we wrote about adding the "study" column in our Step 3:

" For other commonly used parameters here:

"seurat_data": The Seurat object of your transcriptomic data after prepocessing. The Seurat object should include Seurat log normalization, clustering and umap reductions. Your meta.data should include a column named "study" which specifies the batch IDs for all cells (this column will be passed to scArches-SCANVI's condition_key parameter). You can look at the example transcriptomic data we provided (using data("sample_query") in R). "

So the "study" column contains the batch IDs for all cells. These batch IDs are used to distinguish cells from different batches (i.e., cells from different experiments/patients), you can use any name you like to represent each batch. e.g., suppose your cells are collected from three separate experiments, it means that you have three different batches of cells, and you can represent the three batches using names such as "batch_1", "batch_2", "batch_3" (or any other names you like, as long as one name represents one batch), and in your meta.data, the "study" column is just used to show what batch each cell belongs to.

In our demo data, we only have one batch, i.e., all cells in our demo data come from just one experiment. We use the name "healthy pancreas" to represent the one batch (You surely can use another name you like).

According to the code you provided, you can add the "study" column to your RNA data by:

sce[["study"]] = ...

The "..." contain the batch names that each cell belongs to.

LiuCanidk commented 1 month ago

@Ruoqiao2020 Hi, thanks for this early reply! Sorry for omitting the instruction in the tutorial. Now I got the meaning of "study" column, but would this batch label be passed to the scArches to remove the batch effect? Since I have done this before in my seurat object, I would rather keep this batch effect removal as the same. By the way, when I run SPIDER with my own seurat object, the output seemed strange that only 1000 features of my seurat object were retained but obviously these were not all features, even no variable features existed. I am sure that I have done all QC and basic data preprocessing steps, including filtering, normalization, scale, find HVGs, clustering. I wonder what is wrong. Also, it seems slow if SPIDER need to process 74253 cells (~60 hours, it seems linearly scaling with 1200 cells of example data), is there any way to accelerate the computing process? I did not find any parallel options in the function SPIDER_predict. However, I notice the GPU and TPU option in the output, how can I use the GPU?

Any advice or discussion would be greatly appreciated!

Ruoqiao2020 commented 1 month ago

About the "study" column: You should still specify the batch IDs in the "study" column. This doesn't affect the original batch effect removal slot in your Seurat data. The final prediction results generated by SPIDER will be saved as separate files, not changing your original RNA data file. If you want to use your original batch effect removal slot for data visualization in the future, you can always go back to your original RNA data file.

The other problems of yours are answered here: https://github.com/Bin-Chen-Lab/spider/issues/3

LiuCanidk commented 1 month ago

@Ruoqiao2020 Thanks for the reply

About the "study" column: You should still specify the batch IDs in the "study" column. This doesn't affect the original batch effect removal slot in your Seurat data. The final prediction results generated by SPIDER will be saved as separate files, not changing your original RNA data file. If you want to use your original batch effect removal slot for data visualization in the future, you can always go back to your original RNA data file.

I understand that the slot of batch effect removal would not be changed and visualization is OK. But what I worried about is that whether the predicting process of SPIDER utilize the embeddings of batch effect removal, e.g., the Harmony embedding. If so, I think it might be better to use the same removal embedding as in the seurat object, rather than another calculation by scArches. And if not, for example, it only utilized the PCA embedding without batch effect removal (with my experience, many tutorials of scRNA-seq analysis tools ignored this..., and use the pca embeddings directly), then the prediction outcome would be not convincing, especially when data integration of scRNA-seq is becomming more and more common.

So, I would like to ask whether the embedding of batch effect removal

Thanks in advance

Ruoqiao2020 commented 1 month ago

For SPIDER's prediction, SPIDER will automatically use the batch effect removal in scArches-SCANVI, no matter if you have already done batch effect removal yourself or not. i.e., SPIDER will not use the Harmony embedding done by yourself, it automatically uses the scArches-SCANVI embedding (the scArches-SCANVI embedding itself is after batch effect removal).

More details about it is described in our paper.

LiuCanidk commented 1 month ago

For SPIDER's prediction, SPIDER will automatically use the batch effect removal in scArches-SCANVI, no matter if you have already done batch effect removal yourself or not. i.e., SPIDER will not use the Harmony embedding done by yourself, it automatically uses the scArches-SCANVI embedding (the scArches-SCANVI embedding itself is after batch effect removal).

More details about it is described in our paper.

Got it. Thanks for your patient reply

Bin-Chen-Lab / spider

Error: cannot find "study" in the seurat object #2