Validate tumor cell annotations for SCPCS000490

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #479

What is the goal of this pull request?

Before finalizing our annotations of tumor cells, we want to explore the results from marker genes, InferCNV, and CopyKAT and validate these findings. Here I'm adding a notebook that is specific to SCPCS000490 to identify and validate the tumor cell annotations.

I'm also adding a TSV file with the annotations for this sample. Part of this is making a folder to store these cell types and the notebooks used to validate the annotations. I chose to put these in the same cell_type_annotations folder rather than the exploratory notebooks since this notebook is specific to validating tumor cell annotations. I envision doing the same thing for a few more samples so will make a new folder for each sample.

Briefly describe the general approach you took to achieve this goal.

The first thing I did is just read in the results from the workflow for annotating tumor cells. This workflow produces results from each method, but does not combine them, so the first thing I did was look at cells that were identified as tumor cells by multiple methods. This is the "easy" sample, so we see a pretty good consensus across methods. I chose to use any cells labeled as tumor by both InferCNV and CopyKAT as tumor.
The rest of the notebook is focused on validation:
- I look at marker gene expression and see that it is higher in tumor cells.
- I used the gene set scores that I calculated and also see that it's higher in tumor cells.
- I then looked for specific gains and losses using the InferCNV output.

Although this is the easier of the samples, I did create some new plots that I think will be helpful in validating tumor cells in other samples. I might have gone overkill with the heatmaps, but I actually think it's helpful to see which cells have expression of each gene/ gene set/ CNV. This should help us identify any cells that we may be categorizing incorrectly in future samples.

Another note is that I expect these notebooks to be tailored very specifically to each sample. This one is mostly validation rather than looking at different classification cutoffs since it was pretty obvious to me, but I envision future notebooks will have some more exploration of different cutoffs.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What types of results does your code produce (e.g., table, figure)?

There is a TSV file with the final annotations that right now only includes tumor_cell_classification. My plan is to build on this as we annotate normal cells and tumor cell states.

Provide directions for reviewers

Here's a zip of the rendered notebook for easy review: SCPCL000822_tumor-cell-validation.html.zip

What are the software and computational requirements needed to be able to run the code in this PR?

Nothing special since it's just rendering a notebook using previously produced results.

Author checklists

Analysis module and review

[ ] This analysis module uses the analysis template and has the expected directory structure.
[ ] The analysis module README.md has been updated to reflect code changes in this pull request.
[x] The analytical code is documented and contains comments.
[ ] Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

[ ] Code in this pull request has been added to the GitHub Action workflow that runs this module.
[ ] The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
[ ] If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
[ ] If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

Rather than trying to get every cell as tumor or normal, why not allow for "unclassified" cells? It would be nice to see where those fall out, and for downstream users to be able to make their own decisions about what to do with cells that are called differently by different algorithms.

What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.

In terms of people making their own decisions, I'm assuming you are talking about contributors that may use this data as input to downstream analysis. Ultimately when we port this over to a module in OpenScPCA-nf, I think we may want to report the calls from CopyKAT and InferCNV in addition to our final calls which have been thoroughly validated. I agree that if we have cells we feel should be "unclassified" then we should state that rather than having them all be "normal", but I think that will be more relevant in the other more difficult samples.

What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.

"Unclassified" may be the wrong word, but yes, the cells where there is disagreement among methods. Part of what I want to see is where these cells are in the various distributions. Are they in the middle?

Looking at the data as I see it, it seems like there could well be some missed tumor cells in the "Normal" group. The number would be small, but I am maybe not as confident as you are!

What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.

"Unclassified" may be the wrong word, but yes, the cells where there is disagreement among methods. Part of what I want to see is where these cells are in the various distributions. Are they in the middle?

Looking at the data as I see it, it seems like there could well be some missed tumor cells in the "Normal" group. The number would be small, but I am maybe not as confident as you are!

I see what you're saying. I can update to have an unknown class if it was identified as tumor by both methods, add that into the plots and then based on the results we can decide if they stay unknown or fit with tumor or normal cells better.

@jashapiro I've made the following changes which I believe should address your comments:

I created a function for plotting the heatmaps, one for plotting gene expression and one for CNV. This should hopefully make the code around the heatmaps simpler.
There is no scaling on the heatmaps, but I had an error in how I set up the color scheme that makes it look like that. They should now have the correct scale and titles for the legends.
I fixed the ordering for the CNV heatmaps so the chromosomes are all in numerical order.
The major update here is the addition of the "Unknown" class throughout exploration. I called only tumor cells as those that are tumor in both CNV methods and normal as cells that were considered normal by both CNV methods. All other cells were classified as "Unknown". The plots then incorporated that updated annotation and you can see that most of those cells line up with the normal cells with a few tumor cells in there. They also do fall in the middle for the CNV gains/losses.
That being said, I then re-annotated the Unknown based on whether or not they had a gain in Chr8 specifically. Alternatively we could keep these as unknowns... but I do think there is some mostly clear separation to be able to classify these.

This should be ready for another look! SCPCL000822_tumor-cell-validation.html.zip

I think I would not add the extra reannotation here, at least not at this stage. There are some subtle bumps in the "Normal" class with higher chr8 scores, So I would not want to use that as a cutoff on their own. I could see adding a column with just the chr8 scores, but I would want to see this across a few samples before deciding that, as we might expect to find different CNVs in different samples.

Just a general comment that I don't expect alterations in Chr8 in every sample. The genome studies of Ewings patients have shown that Chr8, Chr12, Chr16, and Chr1 are most prevalent but not present in every single patient. I was using Chr8 here because it is the most striking of those options and without knowing the actual mutation status of the patient at the DNA level and only using both CopyKAT and InferCNV output, I assume that most tumor cells will have Chr8. So classifying cells with Chr8 may be something that only happens with this sample.

That being said, I would agree that we don't need to reannotate here and I think it's okay to have some unknown/ ambiguous cells. So I kept the plot looking at Chr8, but removed the reannotation.

I think the real question is what is the goal of the "final" annotations at this point? Is it better to leave more cells as ambiguous or undefined when we have disagreements, or to try to get as many cells as possible classified, even if we are wrong. My bias tends to be toward the former, but this case where we have relatively little disagreement between methods may not be the best test case. If only calling where the methods agree resulted in obviously bad results, or very few net calls, I could reconsider!

The goal here is to annotate as many cells as tumor or normal that we can confidently annotate. I think it's okay to have cells that we don't annotate because we don't feel confident in those annotations. Here it's easy because most cells agree and we have a small population of cells that don't. In this case, I am fine with leaving those as "Ambiguous" (I updated the label) since we don't have a consensus.

However, in looking at other samples this week, this is not going to be typical. In the other cases we don't have a lot of agreement between methods so I think we need to investigate and find out which results we trust more and label cells accordingly. In those scenarios I do think we are going to need to do some refining of classification outside of just taking the consensus among methods.

AlexsLemonade / OpenScPCA-analysis