abelson-lab / scATOMIC

Pan-Cancer Single Cell Classifier
MIT License
61 stars 4 forks source link

overestimation of cancer cells? #22

Closed seb951 closed 9 months ago

seb951 commented 1 year ago

Hi,

Thanks for making scATOMIC, it looks promising!

I have tested the package with different normal lung scRNA/snRNA samples and I have a question: I always get a relatively large fraction (>10%) of cancer cells, despite the fact that my samples are from normal, non-cancerous tissue. I understand that you can run create_summary_matrix()with the argument normal_tissue = T, to avoid having cancer cell annotations in the final data.frame. However, all this seems to do, is relabel the potential cancer cells as Normal Tissue Cell (https://github.com/abelson-lab/scATOMIC/blob/bc7adc8486d70b980af149283e85c84e5f7c9abd/R/create_summary_matrix.R#L1896C9-L1896C88), but not change the prediction itself. As such, I would conclude that the algorithm is overestimating the number of cancer cells both in Normal tissue, but also in Cancer tissue (where normal_tissue = F). Along those lines, I sometimes see close to ~90% cancer cells in several Tumor samples, which I find hard to believe.

Anything suggestions on how to check/correct the fact that the algorithm might be over-estimating the fraction of cancer cells?

best,

sebastien

inofechm commented 1 year ago

Hi Sebastien, Thank you for your interest in our work and you are correct on the normal tissue parameter function. In normal samples, most of the non immune/endothelial/fibroblast cells will be annotated originally as cancer cells. This is due to the fact that scATOMIC was trained and designed specifically for cancer samples and there are no normal tissue specific cell types in the first part of the classification model. This is simply due to the fact that it was and is unfeasible to include every single human cell type. In the case of cancer samples, after the first classification step, scATOMIC uses a cancer cell specific signature scoring approach to identify potential normal tissue cells that are originally annotated as cancer (see fig 1f of paper). This approach assumes that there is at least one cancer population and then decides whether there is another population within those that are annotated as cancer and if that population downregulates cancer associated genes it is labelled as normal. In our paper figure 3 we show that this is relatively consistent with inferred copy number variation.

We added the additonal normal tissue function since people in the community have wanted to use scATOMIC for normal samples as well, and we needed a way to account for this. Since the only tissue specific cell types in the model (pan cancer TME hierarchy (fig 1a)) are cancer cells, the model will likely predict everything that is not immune/stromal as cancer cells. The signature scoring method assumes that one population is always cancer so it will annotate many cells as cancer.

Regarding suggestions to check if there is over estimation I recommend two approaches, in both approaches and scATOMIC it is critical that you run each sample separately:

  1. run a CNV inference method such as copykat or infercnv and see if the cancer predicted cells are predicted as aneuploid and normal tissue cells as diploid. In our hands we have ~85% agreement.
  2. cluster the cells and visualize on UMAP, if the cancer cells tend to make up one population than its likely correct, however if the cancer cells are spread across multiple populations and the CNVs are not both predicted aneuploid than it likely over estimated.

In most cancer scrna seq publications I've seen 90% of cancer/normal tissue cell ratios is relatively common so I wouldnt find that too concerning.