Annotate tumor cells in SCPCS000492

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #480

What is the goal of this pull request?

Here I am adding an exploratory notebook that looks at the results from multiple tumor cell annotation methods (output from annotate-tumor-cells.sh) and aims to annotate the tumor cells in SCPCS000492. The output is a TSV file with annotations as either Tumor, Normal, or Ambiguous.

Briefly describe the general approach you took to achieve this goal.

This was not as straight forward as SCPCS000490 where we took the consensus between CNV methods and annotated those as tumor cells. Looking at the initial results from CopyKAT, InferCNV, and marker gene based classification, we don't see a lot of agreement in the cells that are annotated as tumor. Specifically, those with high marker gene expression tend to not necessarily be the ones with higher CNV proportions. Additionally, we don't see many cells with the expected CNVs in Ewing samples and there is not as clear a distinction between what would be considered normal and tumor cells.

Because of this I ended up looking at the gene set scores from the EWS-FLI1 gene sets and using those to classify cells as tumor and normal cells. I was more wary of the CNV results. The gene set scores appeared to mostly line up with the marker gene expression, so ultimately I classified cells as tumor if they had above the mean gene set score for all three gene sets and were considered tumor cells by marker gene analysis. This is probably on the stricter side.

I also ended up trying to use SingleR to classify these cells and used the annotations from SCPCS000490 as the reference. I then compared the annotations from SingleR to just setting cut offs for gene set scores and marker genes. It looks like the SingleR annotations actually capture cells that are considered tumor by both gene expression and CNV methods. I tend to trust those annotations more as they are not quite as arbitrary in the cut offs and using information from a sample that we feel much more confident in.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What is the name of your results bucket on S3?

Just noting that there are no results currently on S3, but the annotations file is ~ 250 KB so it doesn't pass pre commit. So maybe we should be saving those in S3 rather than trying to save them to the repo?

What types of results does your code produce (e.g., table, figure)?

The output is a TSV file with two classification columns - one obtained from SingleR and one from using marker gene and gene set classification. Both columns label cells as either tumor, normal, or ambiguous.

What is your summary of the results?

Generally, I think using SingleR is the more reliable form of annotation here, but it is not without bias and relies on having a really good reference that we feel good about. Originally, I wanted to annotate multiple samples to use as a reference with SingleR for other samples, but I actually wonder if we the rest of samples don't have a clear consensus between CNV and marker gene methods then we should just be using SingleR with the "good" sample.

Also, I tend to believe these samples will contain a high proportion of tumor cells with a low proportion of normal cells, so I expect the majority to be tumor cells which is what we are seeing with SingleR.

Here's a full copy of the rendered notebook that includes commentary on the results:

SCPCL000824_tumor-cell-validation.html.zip

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Should be able to render the notebook locally in R without any special requirements.

Are there particularly areas you'd like reviewers to have a close look at?

Generally, I struggled a lot with this sample since there was no clear consensus between methods. I do tend to trust the gene expression based classification over CNV methods. If there are other plots that would be helpful to show please let me know.

Is there anything that you want to discuss further?

Just another note that I can't commit the output TSV file at this point. I could either add the TSV files to the results directory instead and sync to S3 or make an exception for TSV files in this directory in the pre commit config file.

Author checklists

Analysis module and review

[ ] This analysis module uses the analysis template and has the expected directory structure.
[ ] The analysis module README.md has been updated to reflect code changes in this pull request.
[x] The analytical code is documented and contains comments.
[ ] Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

[ ] Code in this pull request has been added to the GitHub Action workflow that runs this module.
[ ] The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
[ ] If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
[x] If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

What is the name of your results bucket on S3? Just noting that there are no results currently on S3, but the annotations file is ~ 250 KB so it doesn't pass pre commit. So maybe we should be saving those in S3 rather than trying to save them to the repo?

I did have a couple thoughts about file size issues too, before I go. One is that you could just change this to output .tsv.gz and things would shrink a lot. Alternatively you could write out files in some other format, like parquet or arrow. The disadvantage of all of these is that you loose the ability to easily diff the files, but for intermediate files that might be okay, and having them in the repo for easier review/inspection by outside users may be worth the tradeoff.

I think this is probably a larger issue that we want to have a more general discussion about: when do we want to keep outputs in the repo vs. on S3 & how much is it worth to use text files vs. other formats that may be much more efficient, while maintaining ease of use.

A quick alternative is to look more at the raw (well, normalized) marker gene scores or gene set scores: what were the cutoff values from the first dataset, and what happens if you apply that as the cutoff here? Do you get a more expected separation?

I guess what I would do first is plot the the marker gene and gene set expression scores from SCPCL000822 and SCPCL000824 together, to see where theSCPCL000824 scores fall relative to the clear bimodal scores from SCPCL000822. Once we have that, we may have a better sense of where to go next. Hopefully we won't have to go full mixture model, but we might...

I just want to note that for classifying cells using marker gene expression we are:

transforming the expression of each marker gene
summing the transformed values
using a cutoff of sum > 0 to classify cells as tumor or normal

This same approach is being done for both methods and they both show bimodal distributions of marker genes.

SCPCL000822

SCPCL000824

As for the gene set scores, we never actually used the gene set scores to classify cells in SCPCL000822, we only used them to validate that the gene set scores were indeed higher in tumor cells than normal cells. You are correct that here the gene set scores are not a bimodal distribution like they were in SCPCL000822.

So we didn't actually have any cutoffs where we could use the same value in both, but I did test taking the mean of each of the distributions for SCPCL000822 and using those as the cutoffs (I also just estimated cutoffs from the plot that would make sense which are pretty equivalent to the mean). When I do that the results don't really change in terms of which cells get called as tumor cells.

This is an example plot from classifying using one of the gene sets:

I also want to note that I did try using AUCell to classify these cells with each of the gene sets and because of the lack of bimodal distribution, that didn't work very well. Only a few cells were classified as tumor and there was no clear cutoff.

Generally, samples look more like this one where most cells are tumor cells so there probably won't be a clear bimodal distribution.

On the other hand, I am not sure I am comfortable with using SingleR with only 2 classes: it seems likely that it will overfit those two groups. Not sure the immediate solution here, though we could perhaps see what happens with multiple references: one with just tumor-normal, one with more cell types from normal tissue?

I wonder if we could combine a celldex reference with the tumor cells from SCPCL000822 to create a custom reference? This way we would have more than the two classes. This might avoid the overfitting problem and allow for SingleR to hopefully better distinguish between potential normal cell types and then tumor cells.

This same approach is being done for both methods and they both show bimodal distributions of marker genes.

We expect individual genes to have bimodal distributions (zero counts), but in the first example the distribution of the sums was also bimodal:

I might want look at the distribution of the sums without the transformations though: that was what I meant by comparing between the samples to find a cutoff. Zero was nice when it worked, but with a different balance of tumor/normal it is probably showing its limits, and we will have to go back to something more "raw."

I also want to note that I did try using AUCell to classify these cells with each of the gene sets and because of the lack of bimodal distribution, that didn't work very well. Only a few cells were classified as tumor and there was no clear cutoff.

I'd definitely be interested to see what some of your results were with AUCell as well. (Note that your marker gene list is also a potential gene set to look at). Again, I'm most curious about whether you might be able to set AUC cutoffs from 822 that might be able to be applied to other samples, without having to find a bimodal distribution of AUC in every data set. I'm not sure how consistent to expect the cutoffs to be, but it would be worth a look.

I wonder if we could combine a celldex reference with the tumor cells from SCPCL000822 to create a custom reference? This way we would have more than the two classes. This might avoid the overfitting problem and allow for SingleR to hopefully better distinguish between potential normal cell types and then tumor cells.

This is kind of what I was thinking, using the multiple reference functionality of SingleR. I kind of expect that you could keep the "Normal" label in the Tumor/normal reference from 822, as it should not fit better than any of the more specific references, and could serve as a bit of a negative control. Or maybe you could use the SingleR calls for 822 "normal" cells, but replace all of tumor calls with the "tumor" label. Maybe try both.

@jashapiro I did quite a bit of reorganization and made a bunch of changes based both on feedback here and in #537. Rather than seeking to create final annotations here, I am exploring different methods for classifying these cells as tumor cells that rely solely on gene expression.

I kept the section that explores the results from marker genes, CopyKAT, and InferCNV to keep the commentary about these methods not all coming to an agreement.
Then the rest of the notebook focuses on looking at gene expression methods including marker genes, AUCell and UCell.
I also am utilizing information from SCPCL000822 to classify SCPCL000824 because of the lack of bimodal distributions seen in SCPCL000824. This really does make me think that most of the cells are tumor cells.

SCPCL000824_tumor-gene-expression-exploration.html.zip

I might want look at the distribution of the sums without the transformations though: that was what I meant by comparing between the samples to find a cutoff. Zero was nice when it worked, but with a different balance of tumor/normal it is probably showing its limits, and we will have to go back to something more "raw."

Part of the notebook looks at the sum of the normalized gene expression counts for each marker gene and compares the distributions across both samples. I also used the information from 822 to define a cutoff to use for 824 since the distributions cover the same range of values but do not show the same bimodal distribution.

I'd definitely be interested to see what some of your results were with AUCell as well. (Note that your marker gene list is also a potential gene set to look at). Again, I'm most curious about whether you might be able to set AUC cutoffs from 822 that might be able to be applied to other samples, without having to find a bimodal distribution of AUC in every data set. I'm not sure how consistent to expect the cutoffs to be, but it would be worth a look.

I added a section where I looked at AUCell with both gene sets and marker genes. I think it does a nice job in 822 (I confirmed by comparing the annotations to our validated annotations). Because of the lack of bimodal distribution in 824 it doesn't do a very good job, but I used the AUC cutoff from 822 to classify 824 and it looked much more like what I would expect given individual marker gene expression.

I also explored using UCell as recommended in https://github.com/AlexsLemonade/OpenScPCA-analysis/discussions/537#discussioncomment-9840032. Here we get a gene set signature score, similar to the score we are calculating on our own. With UCell you don't see quite the same clear bimodal distribution in the marker genes, but you do in 2 of the gene sets.

This is kind of what I was thinking, using the multiple reference functionality of SingleR. I kind of expect that you could keep the "Normal" label in the Tumor/normal reference from 822, as it should not fit better than any of the more specific references, and could serve as a bit of a negative control. Or maybe you could use the SingleR calls for 822 "normal" cells, but replace all of tumor calls with the "tumor" label. Maybe try both.

This notebook was getting quite long, so I'm going to file an issue to create a separate notebook that looks at using SingleR to classify these cells as a follow up to this notebook.

I did have a couple thoughts about file size issues too, before I go. One is that you could just change this to output .tsv.gz and things would shrink a lot. Alternatively you could write out files in some other format, like parquet or arrow. The disadvantage of all of these is that you loose the ability to easily diff the files, but for intermediate files that might be okay, and having them in the repo for easier review/inspection by outside users may be worth the tradeoff.

For right now, I am just saving this as a .gz file. I think you are right about having a larger discussion re: files that should be in the repo vs. not in the repo.

I made a number of small adjustments to address your comments @jashapiro:

I added an explanation of the gene set scores, we did that with the calculate-gene-set-scores.R script and it's just the mean expression of all genes in a gene set. In this notebook and previous notebooks I use the non-scaled version of this.
I added some more explanatory text around AUCell and UCell. This included an attempt to explain the graphs produced from AUCell.
In regards to your comment about AUCell using all lists, it's my understanding that it calculates a threshold for each gene set independently so we have them for all three EWS-FLI1 gene sets and the marker gene list. I then use the one from the marker gene list to calculate the threshold for the other sample.
Along with that I also added density plots for the AUC distribution like we have for the others and compare the distribution across both samples. They are very similar which is what I expected.
I also added a few confusion matrices comparing both AUCell and UCell annotations to the updated marker gene annotations for SCPCL000824.

Another thing I did here was move some things based on our conversation of not committing tables earlier. I moved all the notebooks for validating tumor cells to exploratory_analysis/annotation_notebooks and then all TSV files are now in results/annotation_tables.

Updated notebook for review: SCPCL000824_tumor-gene-expression-exploration.html.zip

@jashapiro I incorporated your text suggestions and made some minor additional edits based on your comments:

Added a link to the explanation in the AUCell vignette, although I agree it is not very clear there either.
I added some text around the confusion matrices to say which class is which. I also noticed the numbers were way too large for some of them and it appeared we had created an extra large classification data frame along the way so I fixed that.
I updated the explanation around the AUCell density plots and also used scales=free_y. While I was there I adjusted the names that are displayed for the marker gene sets to make the facet labels easier to see.

Updated notebook: SCPCL000824_tumor-gene-expression-exploration.html.zip

AlexsLemonade / OpenScPCA-analysis