abelson-lab / scATOMIC

Pan-Cancer Single Cell Classifier
MIT License
57 stars 5 forks source link

Running scATOMIC on known cancer type samples #7

Closed u2sj closed 1 year ago

u2sj commented 1 year ago

When I run the codes as followed: results<- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, modify_results = T, mc.cores = 4, raw_counts = lung_cancer_demo_data, min_prop = 0.5 )

Upon reviewing the results of my soft tissue tumor datasets (particularly osteosarcoma), I noticed the presence of both brain cancer cells and normal cells. However, there were no ideal osteosarcoma-associated cells identified. As a result, I am contemplating the need to run scATOMIC on known_cancer_type. Unfortunately, I am unsure of how to specify the known_cancer_type parameter.

Thanks!

u2sj commented 1 year ago

Additionally, running the 'create_summary_matrix() function with use_CNVs = TRUE' will increase the accuracy of tumor cell identification?

inofechm commented 1 year ago

Unfortunately, osteosarcoma is one of the cancer types that did not have many samples to train with, as such the model does not perform well in it, we note this in the tutorial (we refer to it as bone cancer). I think it is one of the only cancer types where we get this poor performance. In this case because you know it is osteosarcoma, you can run the following: results<- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, modify_results = T, mc.cores = 4, raw_counts = lung_cancer_demo_data, min_prop = 0.5, known_cancer_type = "Osteosarcoma cell" ) This should convert all brain cancer labels to osteosarcoma.

The use CNV parameter simply adds a CNV inference column to the results which you can decide to use, it will basically try to validate the cancer annotation. In my experience it isnt usually necessary and is more of a validation.

u2sj commented 1 year ago

cell_predictions <- run_scATOMIC(sparse_matrix, confidence_cutoff= T ,mc.cores = (parallel::detectCores() - 8)) results_lung <- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = T, modify_results = T, mc.cores = 8, known_cancer_type ="Osteosarcoma cell", raw_counts = sparse_matrix, min_prop = 0.5 )

Calculated graph and diffusion operator in 13.44 seconds. /home/ubuntu/anaconda3/envs/r-reticulate/lib/python3.9/site-packages/magic/magic.py:455: UserWarning: Returning imputed values for all genes on a (5602 x 34272) matrix will require approximately 1.43GB of memory. Suppress this warning with genes='all_genes' warnings.warn( Running MAGIC with solver='exact' on 34272-dimensional data may take a long time. Consider denoising specific genes with genes=<list-like> or using solver='approximate'. Calculating imputation... Calculated imputation in 17.00 seconds. Calculated MAGIC in 30.74 seconds. Warning: Keys should be one or more alphanumeric characters followed by an underscore, setting key from magicrna to magicrna_ [1] "Added MAGIC output to MAGIC_RNA. To use it, pass assay='MAGIC_RNA' to downstream methods or set seurat_object@active.assay <- 'MAGIC_RNA'." [1] "Sample classification confidence = 1.00" projectdone[1] "Starting Layer 1" /home/ubuntu/anaconda3/envs/r-reticulate/lib/python3.9/site-packages/magic/magic.py:425: UserWarning: Input matrix contains unexpressed genes. Please remove them prior to running MAGIC. warnings.warn( [1] "Done Layer 1" [1] "Starting Layer 2 Blood" /home/ubuntu/anaconda3/envs/r-reticulate/lib/python3.9/site-packages/magic/magic.py:425: UserWarning: Input matrix contains unexpressed genes. Please remove them prior to running MAGIC. warnings.warn( [1] "Done Layer 2 Blood" [1] "Starting Layer 3 TNK" [1] "Done Layer 3 TNK" [1] "Starting Layer 4 CD4 CD8" [1] "nothing to score in this layer" [1] "Done Layer 4 CD4 CD8" [1] "Starting Layer 4 CD8 NK" [1] "nothing to score in this layer"

I found that sometimes the code stops running here. What could be the reason? Thanks!

inofechm commented 1 year ago

This can sometimes happen when you have a really low variance between scores in a classification node, if it happens, either re-run or you can avoid this altogether by setting confidence_cutoff = F in both run_scATOMIC and create_summary_matrix, however in this case you will classify every cell to terminal level. (this only impacts rough 1-5% of cells so if that helps your workflow, I would try that

u2sj commented 1 year ago

cell_predictions <- run_scATOMIC(sparse_matrix, confidence_cutoff= T ,mc.cores = (parallel::detectCores() - 8)) results_lung <- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = T, modify_results = T, mc.cores = 8, known_cancer_type ="Osteosarcoma cell", raw_counts = sparse_matrix, min_prop = 0.5 )

Running Louvain algorithm... 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| Maximum modularity in 10 random starts: 0.9081 Number of communities: 14 Elapsed time: 1 seconds [1] "step1: read and filter data ..." [1] "34272 genes, 8265 cells in raw data" [1] "6773 genes past LOW.DR filtering" [1] "WARNING: low data quality; assigned LOW.DR to UP.DR..." [1] "step 2: annotations gene coordinates ..." [1] "start annotation ..." [1] "step 3: smoothing data with dlm ..." [1] "step 4: measuring baselines ..." [1] "6643 known normal cells found in dataset" [1] "run with known normal..." [1] "baseline is from known input" [1] "step 5: segmentation..." [1] "step 6: convert to genomic bins..." [1] "step 7: adjust baseline ..." [1] "step 8: final prediction ..." Error in hclust(parallelDist::parDist(t(mat.adj), threads = n.cores, method = distance), : 外接函数调用时不能有NA/NaN/Inf(arg10) 此外: Warning messages: 1: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.1 GiB 2: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.1 GiB 3: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.1 GiB 4: In hc.umap == which(cl.ID == max(cl.ID)) : 长的对象长度不是短的对象长度的整倍数 5: In hc.umap == which(cl.ID == min(cl.ID)) : 长的对象长度不是短的对象长度的整倍数

Unfortunately,a new error has occurred.

u2sj commented 1 year ago

cell_predictions <- run_scATOMIC(sparse_matrix, confidence_cutoff= T ,mc.cores = (parallel::detectCores() - 8)) results_lung <- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = T, modify_results = T, mc.cores = 8, known_cancer_type ="Osteosarcoma cell", raw_counts = sparse_matrix, min_prop = 0.5 )

Running Louvain algorithm... 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| Maximum modularity in 10 random starts: 0.9081 Number of communities: 14 Elapsed time: 1 seconds [1] "step1: read and filter data ..." [1] "34272 genes, 8265 cells in raw data" [1] "6773 genes past LOW.DR filtering" [1] "WARNING: low data quality; assigned LOW.DR to UP.DR..." [1] "step 2: annotations gene coordinates ..." [1] "start annotation ..." [1] "step 3: smoothing data with dlm ..." [1] "step 4: measuring baselines ..." [1] "6643 known normal cells found in dataset" [1] "run with known normal..." [1] "baseline is from known input" [1] "step 5: segmentation..." [1] "step 6: convert to genomic bins..." [1] "step 7: adjust baseline ..." [1] "step 8: final prediction ..." Error in hclust(parallelDist::parDist(t(mat.adj), threads = n.cores, method = distance), : 外接函数调用时不能有NA/NaN/Inf(arg10) 此外: Warning messages: 1: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.1 GiB 2: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.1 GiB 3: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.1 GiB 4: In hc.umap == which(cl.ID == max(cl.ID)) : 长的对象长度不是短的对象长度的整倍数 5: In hc.umap == which(cl.ID == min(cl.ID)) : 长的对象长度不是短的对象长度的整倍数

Unfortunately,a new error has occurred.

inofechm commented 1 year ago

Hello again, This is an error with CopyKats's code. I can't resolve it, I believe you have some cells in your data with 0 variance between them in this case that is leading to the error. have you filtered out low quality cells?

u2sj commented 1 year ago

Thank you for your response. I have already strictly controlled the quality with parameters such as doublefinder and nfeature>500, mt<15%. However, the issue still occurs, although it runs smoothly in most samples. But we hope to run it on all samples. Are there any other quality control requirements?

inofechm commented 1 year ago

I'm not sure, it seems to be an issue with having all zero columns in the matrix going into copykat: see https://support.bioconductor.org/p/45944/ I would recommend trying to run everything with use_CNVs=F and then either also run copykat seperately on each sample or a different CNV inference tool. I find that the copyKat github is not reponsive and maintained so there is not much I can do to address this bug.