abelson-lab / scATOMIC

Pan-Cancer Single Cell Classifier
MIT License
57 stars 5 forks source link

how to accelarate the annotation of large number (more than 1 million) of cells? #21

Closed xflicsu closed 4 months ago

xflicsu commented 9 months ago

Dear scATOMIC developer, scATOMIC provid us a powerful solution for cell annotation. I wonder how to accelarate the process whe we have large number of cells?

inofechm commented 9 months ago

Hi Xianfeng, Thank you for the interest in our work! scATOMIC is meant to only be used for every sample separately and I assume you dont have single samples that have one million cells. As such the general approach for very large multi-sample datasets is to first split each sample into a count matrix, then run the scatomic pipeline and finally we harmonize the matrix. For this I will assume you are using a high performance compute cluster server. When I want to annotate a multi million cell dataset with scATOMIC, the job I would use is as follows pipeline I use is as follows:

I use a bash script (launch_scATOMIC.sh) to launch scATOMIC job per sample (we use altair grid engine as our job scheduler, but this may be different for you):

#!/bin/bash -l
#$ -S /bin/bash
#$ -cwd
module load rstats/4.2.2
Rscript --no-save run_scATOMIC_per_sample.R $1 

here run_scATOMIC_per_sample.R represents the following R script and $1 represents the sample ID.

#pass argument from bash launch script to R script
args <- commandArgs(trailingOnly = "TRUE")
sample_use <- args[1]
#load libraries
library(scATOMIC)
library(plyr)
library(dplyr)
library(data.table)
library(randomForest)
library(caret)
library(parallel)
library(reticulate)
library(Rmagic)
library(Matrix)
library(Seurat)
library(agrmt)
library(cutoff.scATOMIC)
library(copykat)
#load entire count matrix, this can also be a seurat obj, or if you already have count matrices per sample you can skip this.
master_count_matrix = readRDS('PATH_TO_WHOLE_MATRIX')
#this should be a file you make where you have a sampleID corresponding to each cell name in master_count_matrix
master_metadata = readRDS('PATH_TO_WHOLE_MATRIX')
cells_keep = master_metadata[which(master_metadata$sampleID == sample_use,'cell_name']
sample_count_mat = master_count_matrix[,cells_keep]
pct_mt <- colSums(sample_count_mat[grep("^MT-", row.names(sample_count_mat)),])/colSums(sample_count_mat) * 100
nFeatureRNA <- colSums(sample_count_mat > 0)
sample_count_mat <- sample_count_mat[, names(which(pct_mt < 25))]
sample_count_mat <- sample_count_mat[, intersect(names(which(nFeatureRNA > 500)), colnames(sample_count_mat))]
#run scATOMIC step 1
cell_predictions <- run_scATOMIC(sample_count_mat)
#run scATOMIC step 2
results_scATOMIC <- create_summary_matrix(prediction_list = cell_predictions, raw_counts = sample_count_mat)
results_scATOMIC$sample <- sample_use
#saving results in a directory that you have previously created to store them
fwrite(results_scATOMIC, paste0("scATOMIC_annotations_per_sample/scATOMIC_annotations_sample_used_",sample_use, "___.csv.gz" ))

to launch all jobs I have a file that has all the unique sampleIDs called samples.txt with each sample ID separated by a new line and I run in the HPC command line:

for i in $(cat sampleIDS.txt); do qsub -l h_vmem=50G launch_scATOMIC.sh $i ;done

Finally after all samples are run (this shouldnt take more than an hour, unless you have a long queue to wait for your jobs to run) I run a code to concatenate the results:

library(data.table)
scATOMIC_files <- list.files("scATOMIC_annotations_per_sample")
scATOMIC_res_list = list()
for(i in 1:length(scATOMIC_files)){
  scATOMIC_res_list[[i]] = fread(scATOMIC_files[i])
}
master_scATOMIC_list = rbindlist(scATOMIC_res_list)

I hope this helps. Ido