abelson-lab / scATOMIC

Pan-Cancer Single Cell Classifier
MIT License
61 stars 4 forks source link

Where are the individual prediction scores? #25

Closed que-ro closed 9 months ago

que-ro commented 9 months ago

Hi, first of all, thank you for this amazing software!

I am using this software to analyze different liver cancer HCC datasets, and I use the following parameters as they agree better with the annotation of the source paper.

run_scATOMIC(sparse_matrix, mc.cores = 1)
create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, modify_results = F, mc.cores = 1, raw_counts = sparse_matrix )

Basically, I just launch the script without the cancer cell trimming step of searching for healthy cells in cancer cell clusters.

The problem: I wanted each cell to have the individual prediction score calculated by the last model for the specified cell type label. By changing a piece of your script, I managed to see all the individual prediction scores corresponding to the pred score of the identified cell type. For the first layer, each cell has an individual prediction score, but then, going into further layers, the majority of them lose their individual prediction score.

From what I understood, you first blast all your models to give a first prediction of what could be the final cell type for each cell. Then you go to the next layer of models and you map the previous cell type label with the respective model, and if there is no corresponding model to this cell type annotation, you take the previous prediction score and keep the same label.

Then, what do those NaN values mean? Does it mean that there are some criteria at each layer that says if a cell is going to be processed or not further, like for layer_2_non_blood where some predicted liver cancer cells in layer 1 are bypassed so there are no predictions and like no final prediction for the final "Liver Cancer Cell" classifier layer_5_biliary? Is there an option to force the prediction?

More in detail:

I modified the create_summary_summary function and just removed the median calculations, example for the first layer:

summary_master <-  cbind(summary_master, layer_1)
layer_2 <- c()
score_class_layer_1 <- c()
for (i in 1:nrow(summary_master)){
  if(summary_master[i, "layer_1"] %in% c("HSPC","B cell","CD4+ T cell", "CD8+ T cell", "macrophage_or_dendritic_cell","Mast cell",
                                         "Natural killer cell","Blood_Cell")){

    score_class_layer_1[i] <- as.numeric(prediction_list[["layer_1"]][i,"blood_score"])
    layer_2[i] <- as.character(prediction_list[["layer_2_blood"]][summary_master$cell_names[i],"predicted_tissue_with_cutoff"])

  } else if(summary_master[i, "layer_1"] %in%
            c( "Bile Duct Cancer","Bladder Cancer",  "Bone Cancer",     "Brain Cancer",    "Breast Cancer",
               "Colon/Colorectal Cancer","Endometrial/Uterine Cancer","Esophageal Cancer",
               "Gallbladder Cancer", "Gastric Cancer", "Glial Cells",  "Kidney Cancer", "Liver Cancer", "Lung Cancer",
               "Neuroblastoma",  "Oligodendrocytes","Ovarian Cancer",
               "Pancreatic Cancer",
               "Prostate Cancer", "Skin Cancer",  "Endothelial Cells", "Fibroblasts", "Myofibroblasts",  "Smooth Muscle Cells",
               "Sarcoma", "Tissue_Cell_Normal_or_Cancer")){
    layer_2[i] <- as.character(prediction_list[["layer_2_non_blood"]][summary_master$cell_names[i],"predicted_tissue_with_cutoff"])
    score_class_layer_1[i] <- as.numeric(prediction_list[["layer_1"]][i,"cancer_normal_stromal_score"])

  } else {
    layer_2[i] <- as.character(layer_1[i])
    score_class_layer_1[i] <- NA
  }

I have done that for each layer and it works fine. Then if I look the individual prediction score that are NaN:

idx_liver_cancer <- which(summary_master$layer_6 == "Liver Cancer Cell")
sum(is.na(summary_master$score_class_layer_1[idx_liver_cancer]))
sum(is.na(summary_master$score_class_layer_2[idx_liver_cancer]))
sum(is.na(summary_master$score_class_layer_3[idx_liver_cancer]))
sum(is.na(summary_master$score_class_layer_4[idx_liver_cancer]))
sum(is.na(summary_master$score_class_layer_5[idx_liver_cancer]))
sum(is.na(summary_master$score_class_layer_6[idx_liver_cancer]))

[1] 0
[1] 625
[1] 681
[1] 681
[1] 681
[1] 681

With a total of 981 of cells that are labelled as "Liver Cancer Cell", which is almost in agreement with what the authors of the datasets found out.

inofechm commented 9 months ago

Hi, Thanks for you interest in scATOMIC.

Does it mean that there are some criteria at each layer that says if a cell is going to be processed or not further, like for layer_2_non_blood where some predicted liver cancer cells in layer 1 are bypassed so there are no predictions and like no final prediction for the final "Liver Cancer Cell" classifier layer_5_biliary?

Each cell receives a score at each layer and for each class in the respective layer there is a calculated threshold for the cell to continue to the next model (you can see this in our supplementary note from the publication). You can turn off this threshold and force the terminal predictions by setting the parameter confidence_cutoff = F in run_scATOMIC and create_summary_matrix

if you are looking for the cell type prediction scores you can find these in the list generated by run_scATOMIC, where each layer will have the scores for each class plus the sum of score for the next group in the hierarchy (ie blood score and non blood score in layer 1).

I have done that for each layer and it works fine. Then if I look the individual prediction score that are NaN.

to be honest I am a bit confused about what you did with regards to modifying the code, perhaps the clarifications above will make you not have to do this. For me to be able to understand what these NaNs I would need to take a look at your data.

inofechm commented 9 months ago

if you use the tutorial data in cell 'AAACCTGTCGAATGCT-1' which has the following classifications: results_lung['AAACCTGTCGAATGCT-1',] layer_1 layer_2 layer_3 layer_4 layer_5 layer_6 Tissue_Cell_Normal_or_Cancer Non Stromal Cell Non GI Epithelial Cell Breast/Lung/Prostate Cell Lung Cancer Cell Lung Cancer Cell

you can retrieve the scores for this cell at each layer with the following code:

max(cell_predictions[["layer_1"]]['AAACCTGTCGAATGCT-1',][sapply(cell_predictions[["layer_1"]],is.numeric)]) max(cell_predictions[["layer_2_non_blood"]]['AAACCTGTCGAATGCT-1',][sapply(cell_predictions[["layer_2_non_blood"]],is.numeric)]) max(cell_predictions[["layer_3_non_stromal"]]['AAACCTGTCGAATGCT-1',][sapply(cell_predictions[["layer_3_non_stromal"]],is.numeric)]) max(cell_predictions[["layer_4_non_GI"]]['AAACCTGTCGAATGCT-1',][sapply(cell_predictions[["layer_4_non_GI"]],is.numeric)]) max(cell_predictions[["layer_5_breast_lung_prostate"]]['AAACCTGTCGAATGCT-1',][sapply(cell_predictions[["layer_5_breast_lung_prostate"]],is.numeric)])

que-ro commented 9 months ago

Hi, thank you for the quick answer!

I tried without the cutoff confidence, it identified more liver cancer cells, but I still have the NaN pred score for the terminal classification. This is the command I use to run scATOMIC:

cell_predictions <- run_scATOMIC(sparse_matrix, mc.cores = 1, confidence_cutoff = F)
results <- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, modify_results = F, mc.cores = 1, raw_counts = sparse_matrix, confidence_cutoff=F)

I put some examples to show you the detail of what I obtain for two liver cancer cells, one that gets a terminal prediction and one where somehow it still classify it as liver cancer cell, but without specifying the terminal prediction score:

idx_liver_cancer <- which(results$layer_6 == "Liver Cancer Cell")
idx_liver_cancer_with_na_in_score_class_layer_6 <- which(is.na(results$score_class_layer_6) & results$layer_6 == "Liver Cancer Cell")

liver_ccell_with_final_pred <- results$cell_names[idx_liver_cancer[1]]
results[liver_ccell_with_final_pred,]

                           cell_names                      layer_1
AAACCTGTCTCTAAGG-1 AAACCTGTCTCTAAGG-1 Tissue_Cell_Normal_or_Cancer
                            layer_2            layer_3       layer_4
AAACCTGTCTCTAAGG-1 Non Stromal Cell GI Epithelial Cell Billiary Cell
                             layer_5           layer_6 score_class_layer_1
AAACCTGTCTCTAAGG-1 Liver Cancer Cell Liver Cancer Cell               0.784
                   score_class_layer_2 score_class_layer_3 score_class_layer_4
AAACCTGTCTCTAAGG-1               0.608               0.756               0.896
                   score_class_layer_5 score_class_layer_6
AAACCTGTCTCTAAGG-1               0.698               0.698

There, everything looks nice, each layer gives me the cell type classification evolution and the prediction score are there. Now for the majority of liver cancer cells detected, I have this situation:

liver_ccell_wo_final_pred <- results$cell_names[idx_liver_cancer_with_na_in_score_class_layer_6[2]]
results[liver_ccell_wo_final_pred,]

                           cell_names                      layer_1
CCACGGATCATCGGAT-1 CCACGGATCATCGGAT-1 Tissue_Cell_Normal_or_Cancer
                            layer_2            layer_3       layer_4
CCACGGATCATCGGAT-1 Non Stromal Cell GI Epithelial Cell Billiary Cell
                             layer_5           layer_6 score_class_layer_1
CCACGGATCATCGGAT-1 Liver Cancer Cell Liver Cancer Cell               0.924
                   score_class_layer_2 score_class_layer_3 score_class_layer_4
CCACGGATCATCGGAT-1               0.688                  NA                  NA
                   score_class_layer_5 score_class_layer_6
CCACGGATCATCGGAT-1                  NA                  NA

There it also gives me the evolution of cell type annotation but it stops giving me the prediction score after the layer2. That's where I still don't know where in the script it actually blocks or fails to retrieve the predictions from cell_predictions to results through the create_summary_text function.

Looking at cell_predictions, it contains for all liver cancer cells, all the predictions scores: Last and before last layer for cell with final prediction score:

print(cell_predictions[["layer_4_GI"]][liver_ccell_with_final_pred, c("billiary_score", "predicted_tissue_with_cutoff")])
print(cell_predictions[["layer_5_biliary"]][liver_ccell_with_final_pred, c("Liver Cancer", "predicted_tissue_with_cutoff")])
                   billiary_score predicted_tissue_with_cutoff
AAACCTGTCTCTAAGG-1          0.884                Billiary Cell
                   Liver Cancer predicted_tissue_with_cutoff
AAACCTGTCTCTAAGG-1        0.712                 Liver Cancer

Last and before last layer for cell without final prediction score:

print(cell_predictions[["layer_4_GI"]][liver_ccell_wo_final_pred, c("billiary_score", "predicted_tissue_with_cutoff")])
print(cell_predictions[["layer_5_biliary"]][liver_ccell_wo_final_pred, c("Liver Cancer", "predicted_tissue_with_cutoff")])
                   billiary_score predicted_tissue_with_cutoff
TGAGCATAGAGTGAGA-1           0.88                Billiary Cell
                   Liver Cancer predicted_tissue_with_cutoff
TGAGCATAGAGTGAGA-1         0.71                 Liver Cancer

I am not sure of what is happening there, if it is designed this way or if it might be a bug?

I added a zip file containing the patient matrix I am using to launch this script, and the results and cell_predictions in .rds format, if needed. (Note: results will not contain the median values, but the individual scores)

H08_scATOMIC.zip

inofechm commented 9 months ago

Hello I think I found the issue and the solution for the bug that you have in your modified code. The code that you modified is incorrectly matching the cells to their scores: for example in

cell_preds[["layer_2_non_blood"]]['CCACGGACAGGATCGA-1',]
                   Bile Duct Cancer Bladder Cancer Bone Cancer Brain Cancer Breast Cancer Cancer Associated Fibroblasts Cancer Associated Myofibroblasts
CCACGGACAGGATCGA-1             0.01           0.01           0        0.012          0.02                         0.094                            0.054
                   Colon/Colorectal Cancer Endometrial/Uterine Cancer Endothelial Cells Esophageal Cancer Fibroblasts Gallbladder Cancer Gastric Cancer Glial Cells
CCACGGACAGGATCGA-1                   0.064                      0.018             0.036             0.018       0.042                  0          0.032       0.014
                   Kidney Cancer Liver Cancer Lung Cancer Myofibroblasts Neuroblastoma Oligodendrocytes Ovarian Cancer Pancreatic Cancer Prostate Cancer Sarcoma
CCACGGACAGGATCGA-1         0.032        0.306       0.064          0.042         0.002                0          0.034             0.038           0.004   0.004
                   Skin Cancer Smooth Muscle Cells stromal_score non_stromal_score  predicted_class predicted_tissue_with_cutoff
CCACGGACAGGATCGA-1       0.022               0.028         0.296              0.69 Non Stromal Cell             Non Stromal Cell

the score is 0.69 but your modified code produced 0.136

 results['CCACGGACAGGATCGA-1',]
                           cell_names                      layer_1          layer_2            layer_3       layer_4           layer_5           layer_6
CCACGGACAGGATCGA-1 CCACGGACAGGATCGA-1 Tissue_Cell_Normal_or_Cancer Non Stromal Cell GI Epithelial Cell Billiary Cell Liver Cancer Cell Liver Cancer Cell
                   score_class_layer_1 score_class_layer_2 score_class_layer_3 score_class_layer_4 score_class_layer_5 score_class_layer_6
CCACGGACAGGATCGA-1               0.878               0.136                  NA                  NA                  NA                  NA

because you are using i instead of summary_master$cell_names[i] its using the row number (derived from row index in summary matrix which is different than the rows in the prediction matrices) instead of the cell name to index. I believe if you change your code to replace the i with summary_master$cell_names[i] it should work for exmaple in the first instance:

score_class_layer_1[i] <- as.numeric(prediction_list[["layer_1"]][summary_master$cell_names[i],"cancer_normal_stromal_score"])

please let me know if this works and i will close the issue.

que-ro commented 9 months ago

Hi, this was exactly the problem. Thank you very much and sorry that I bothered you with my own mistake.

I will close this issue with the plot I am trying to automatize for each datasets:

H70_agreement_plot_paper_scatomic