Closed allyhawkins closed 3 months ago
Rather than trying to get every cell as tumor or normal, why not allow for "unclassified" cells? It would be nice to see where those fall out, and for downstream users to be able to make their own decisions about what to do with cells that are called differently by different algorithms.
What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.
In terms of people making their own decisions, I'm assuming you are talking about contributors that may use this data as input to downstream analysis. Ultimately when we port this over to a module in OpenScPCA-nf
, I think we may want to report the calls from CopyKAT
and InferCNV
in addition to our final calls which have been thoroughly validated. I agree that if we have cells we feel should be "unclassified" then we should state that rather than having them all be "normal", but I think that will be more relevant in the other more difficult samples.
What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.
"Unclassified" may be the wrong word, but yes, the cells where there is disagreement among methods. Part of what I want to see is where these cells are in the various distributions. Are they in the middle?
Looking at the data as I see it, it seems like there could well be some missed tumor cells in the "Normal" group. The number would be small, but I am maybe not as confident as you are!
What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.
"Unclassified" may be the wrong word, but yes, the cells where there is disagreement among methods. Part of what I want to see is where these cells are in the various distributions. Are they in the middle?
Looking at the data as I see it, it seems like there could well be some missed tumor cells in the "Normal" group. The number would be small, but I am maybe not as confident as you are!
I see what you're saying. I can update to have an unknown class if it was identified as tumor by both methods, add that into the plots and then based on the results we can decide if they stay unknown or fit with tumor or normal cells better.
@jashapiro I've made the following changes which I believe should address your comments:
This should be ready for another look! SCPCL000822_tumor-cell-validation.html.zip
I think I would not add the extra reannotation here, at least not at this stage. There are some subtle bumps in the "Normal" class with higher chr8 scores, So I would not want to use that as a cutoff on their own. I could see adding a column with just the chr8 scores, but I would want to see this across a few samples before deciding that, as we might expect to find different CNVs in different samples.
Just a general comment that I don't expect alterations in Chr8 in every sample. The genome studies of Ewings patients have shown that Chr8, Chr12, Chr16, and Chr1 are most prevalent but not present in every single patient. I was using Chr8 here because it is the most striking of those options and without knowing the actual mutation status of the patient at the DNA level and only using both CopyKAT and InferCNV output, I assume that most tumor cells will have Chr8. So classifying cells with Chr8 may be something that only happens with this sample.
That being said, I would agree that we don't need to reannotate here and I think it's okay to have some unknown/ ambiguous cells. So I kept the plot looking at Chr8, but removed the reannotation.
I think the real question is what is the goal of the "final" annotations at this point? Is it better to leave more cells as ambiguous or undefined when we have disagreements, or to try to get as many cells as possible classified, even if we are wrong. My bias tends to be toward the former, but this case where we have relatively little disagreement between methods may not be the best test case. If only calling where the methods agree resulted in obviously bad results, or very few net calls, I could reconsider!
The goal here is to annotate as many cells as tumor or normal that we can confidently annotate. I think it's okay to have cells that we don't annotate because we don't feel confident in those annotations. Here it's easy because most cells agree and we have a small population of cells that don't. In this case, I am fine with leaving those as "Ambiguous" (I updated the label) since we don't have a consensus.
However, in looking at other samples this week, this is not going to be typical. In the other cases we don't have a lot of agreement between methods so I think we need to investigate and find out which results we trust more and label cells accordingly. In those scenarios I do think we are going to need to do some refining of classification outside of just taking the consensus among methods.
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
Closes #479
What is the goal of this pull request?
Before finalizing our annotations of tumor cells, we want to explore the results from marker genes, InferCNV, and CopyKAT and validate these findings. Here I'm adding a notebook that is specific to SCPCS000490 to identify and validate the tumor cell annotations.
I'm also adding a TSV file with the annotations for this sample. Part of this is making a folder to store these cell types and the notebooks used to validate the annotations. I chose to put these in the same
cell_type_annotations
folder rather than the exploratory notebooks since this notebook is specific to validating tumor cell annotations. I envision doing the same thing for a few more samples so will make a new folder for each sample.Briefly describe the general approach you took to achieve this goal.
Although this is the easier of the samples, I did create some new plots that I think will be helpful in validating tumor cells in other samples. I might have gone overkill with the heatmaps, but I actually think it's helpful to see which cells have expression of each gene/ gene set/ CNV. This should help us identify any cells that we may be categorizing incorrectly in future samples.
Another note is that I expect these notebooks to be tailored very specifically to each sample. This one is mostly validation rather than looking at different classification cutoffs since it was pretty obvious to me, but I envision future notebooks will have some more exploration of different cutoffs.
If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes
Results
What types of results does your code produce (e.g., table, figure)?
There is a TSV file with the final annotations that right now only includes
tumor_cell_classification
. My plan is to build on this as we annotate normal cells and tumor cell states.Provide directions for reviewers
Here's a zip of the rendered notebook for easy review: SCPCL000822_tumor-cell-validation.html.zip
What are the software and computational requirements needed to be able to run the code in this PR?
Nothing special since it's just rendering a notebook using previously produced results.
Author checklists
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.