AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
5 stars 14 forks source link

Validate tumor cell annotations for SCPCS000490 #500

Closed allyhawkins closed 3 months ago

allyhawkins commented 3 months ago

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #479

What is the goal of this pull request?

Before finalizing our annotations of tumor cells, we want to explore the results from marker genes, InferCNV, and CopyKAT and validate these findings. Here I'm adding a notebook that is specific to SCPCS000490 to identify and validate the tumor cell annotations.

I'm also adding a TSV file with the annotations for this sample. Part of this is making a folder to store these cell types and the notebooks used to validate the annotations. I chose to put these in the same cell_type_annotations folder rather than the exploratory notebooks since this notebook is specific to validating tumor cell annotations. I envision doing the same thing for a few more samples so will make a new folder for each sample.

Briefly describe the general approach you took to achieve this goal.

Although this is the easier of the samples, I did create some new plots that I think will be helpful in validating tumor cells in other samples. I might have gone overkill with the heatmaps, but I actually think it's helpful to see which cells have expression of each gene/ gene set/ CNV. This should help us identify any cells that we may be categorizing incorrectly in future samples.

Another note is that I expect these notebooks to be tailored very specifically to each sample. This one is mostly validation rather than looking at different classification cutoffs since it was pretty obvious to me, but I envision future notebooks will have some more exploration of different cutoffs.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What types of results does your code produce (e.g., table, figure)?

There is a TSV file with the final annotations that right now only includes tumor_cell_classification. My plan is to build on this as we annotate normal cells and tumor cell states.

Provide directions for reviewers

Here's a zip of the rendered notebook for easy review: SCPCL000822_tumor-cell-validation.html.zip

What are the software and computational requirements needed to be able to run the code in this PR?

Nothing special since it's just rendering a notebook using previously produced results.

Author checklists

Analysis module and review

Reproducibility checklist

allyhawkins commented 3 months ago

Rather than trying to get every cell as tumor or normal, why not allow for "unclassified" cells? It would be nice to see where those fall out, and for downstream users to be able to make their own decisions about what to do with cells that are called differently by different algorithms.

What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.

In terms of people making their own decisions, I'm assuming you are talking about contributors that may use this data as input to downstream analysis. Ultimately when we port this over to a module in OpenScPCA-nf, I think we may want to report the calls from CopyKAT and InferCNV in addition to our final calls which have been thoroughly validated. I agree that if we have cells we feel should be "unclassified" then we should state that rather than having them all be "normal", but I think that will be more relevant in the other more difficult samples.

jashapiro commented 3 months ago

What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.

"Unclassified" may be the wrong word, but yes, the cells where there is disagreement among methods. Part of what I want to see is where these cells are in the various distributions. Are they in the middle?

Looking at the data as I see it, it seems like there could well be some missed tumor cells in the "Normal" group. The number would be small, but I am maybe not as confident as you are!

allyhawkins commented 3 months ago

What would you call unclassified here? Are you saying anything that's called as a tumor cell with one method but not the other methods should fall into this category? I'm fine with having an unclassified group of cells, but I don't actually feel like any of these in this sample should be at this point. I think there's a pretty clear distinction between tumor and non-tumor cells when looking at all the data together.

"Unclassified" may be the wrong word, but yes, the cells where there is disagreement among methods. Part of what I want to see is where these cells are in the various distributions. Are they in the middle?

Looking at the data as I see it, it seems like there could well be some missed tumor cells in the "Normal" group. The number would be small, but I am maybe not as confident as you are!

I see what you're saying. I can update to have an unknown class if it was identified as tumor by both methods, add that into the plots and then based on the results we can decide if they stay unknown or fit with tumor or normal cells better.

allyhawkins commented 3 months ago

@jashapiro I've made the following changes which I believe should address your comments:

This should be ready for another look! SCPCL000822_tumor-cell-validation.html.zip

allyhawkins commented 3 months ago

I think I would not add the extra reannotation here, at least not at this stage. There are some subtle bumps in the "Normal" class with higher chr8 scores, So I would not want to use that as a cutoff on their own. I could see adding a column with just the chr8 scores, but I would want to see this across a few samples before deciding that, as we might expect to find different CNVs in different samples.

Just a general comment that I don't expect alterations in Chr8 in every sample. The genome studies of Ewings patients have shown that Chr8, Chr12, Chr16, and Chr1 are most prevalent but not present in every single patient. I was using Chr8 here because it is the most striking of those options and without knowing the actual mutation status of the patient at the DNA level and only using both CopyKAT and InferCNV output, I assume that most tumor cells will have Chr8. So classifying cells with Chr8 may be something that only happens with this sample.

That being said, I would agree that we don't need to reannotate here and I think it's okay to have some unknown/ ambiguous cells. So I kept the plot looking at Chr8, but removed the reannotation.

I think the real question is what is the goal of the "final" annotations at this point? Is it better to leave more cells as ambiguous or undefined when we have disagreements, or to try to get as many cells as possible classified, even if we are wrong. My bias tends to be toward the former, but this case where we have relatively little disagreement between methods may not be the best test case. If only calling where the methods agree resulted in obviously bad results, or very few net calls, I could reconsider!

The goal here is to annotate as many cells as tumor or normal that we can confidently annotate. I think it's okay to have cells that we don't annotate because we don't feel confident in those annotations. Here it's easy because most cells agree and we have a small population of cells that don't. In this case, I am fine with leaving those as "Ambiguous" (I updated the label) since we don't have a consensus.

However, in looking at other samples this week, this is not going to be typical. In the other cases we don't have a lot of agreement between methods so I think we need to investigate and find out which results we trust more and label cells accordingly. In those scenarios I do think we are going to need to do some refining of classification outside of just taking the consensus among methods.