AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
5 stars 14 forks source link

Annotate tumor cells in SCPCS000492 #532

Closed allyhawkins closed 2 months ago

allyhawkins commented 3 months ago

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #480

What is the goal of this pull request?

Here I am adding an exploratory notebook that looks at the results from multiple tumor cell annotation methods (output from annotate-tumor-cells.sh) and aims to annotate the tumor cells in SCPCS000492. The output is a TSV file with annotations as either Tumor, Normal, or Ambiguous.

Briefly describe the general approach you took to achieve this goal.

This was not as straight forward as SCPCS000490 where we took the consensus between CNV methods and annotated those as tumor cells. Looking at the initial results from CopyKAT, InferCNV, and marker gene based classification, we don't see a lot of agreement in the cells that are annotated as tumor. Specifically, those with high marker gene expression tend to not necessarily be the ones with higher CNV proportions. Additionally, we don't see many cells with the expected CNVs in Ewing samples and there is not as clear a distinction between what would be considered normal and tumor cells.

Because of this I ended up looking at the gene set scores from the EWS-FLI1 gene sets and using those to classify cells as tumor and normal cells. I was more wary of the CNV results. The gene set scores appeared to mostly line up with the marker gene expression, so ultimately I classified cells as tumor if they had above the mean gene set score for all three gene sets and were considered tumor cells by marker gene analysis. This is probably on the stricter side.

I also ended up trying to use SingleR to classify these cells and used the annotations from SCPCS000490 as the reference. I then compared the annotations from SingleR to just setting cut offs for gene set scores and marker genes. It looks like the SingleR annotations actually capture cells that are considered tumor by both gene expression and CNV methods. I tend to trust those annotations more as they are not quite as arbitrary in the cut offs and using information from a sample that we feel much more confident in.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What is the name of your results bucket on S3?

Just noting that there are no results currently on S3, but the annotations file is ~ 250 KB so it doesn't pass pre commit. So maybe we should be saving those in S3 rather than trying to save them to the repo?

What types of results does your code produce (e.g., table, figure)?

The output is a TSV file with two classification columns - one obtained from SingleR and one from using marker gene and gene set classification. Both columns label cells as either tumor, normal, or ambiguous.

What is your summary of the results?

Generally, I think using SingleR is the more reliable form of annotation here, but it is not without bias and relies on having a really good reference that we feel good about. Originally, I wanted to annotate multiple samples to use as a reference with SingleR for other samples, but I actually wonder if we the rest of samples don't have a clear consensus between CNV and marker gene methods then we should just be using SingleR with the "good" sample.

Also, I tend to believe these samples will contain a high proportion of tumor cells with a low proportion of normal cells, so I expect the majority to be tumor cells which is what we are seeing with SingleR.

Here's a full copy of the rendered notebook that includes commentary on the results:

SCPCL000824_tumor-cell-validation.html.zip

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Should be able to render the notebook locally in R without any special requirements.

Are there particularly areas you'd like reviewers to have a close look at?

Generally, I struggled a lot with this sample since there was no clear consensus between methods. I do tend to trust the gene expression based classification over CNV methods. If there are other plots that would be helpful to show please let me know.

Is there anything that you want to discuss further?

Just another note that I can't commit the output TSV file at this point. I could either add the TSV files to the results directory instead and sync to S3 or make an exception for TSV files in this directory in the pre commit config file.

Author checklists

Analysis module and review

Reproducibility checklist

jashapiro commented 3 months ago

What is the name of your results bucket on S3? Just noting that there are no results currently on S3, but the annotations file is ~ 250 KB so it doesn't pass pre commit. So maybe we should be saving those in S3 rather than trying to save them to the repo?

I did have a couple thoughts about file size issues too, before I go. One is that you could just change this to output .tsv.gz and things would shrink a lot. Alternatively you could write out files in some other format, like parquet or arrow. The disadvantage of all of these is that you loose the ability to easily diff the files, but for intermediate files that might be okay, and having them in the repo for easier review/inspection by outside users may be worth the tradeoff.

I think this is probably a larger issue that we want to have a more general discussion about: when do we want to keep outputs in the repo vs. on S3 & how much is it worth to use text files vs. other formats that may be much more efficient, while maintaining ease of use.

allyhawkins commented 3 months ago

A quick alternative is to look more at the raw (well, normalized) marker gene scores or gene set scores: what were the cutoff values from the first dataset, and what happens if you apply that as the cutoff here? Do you get a more expected separation?

I guess what I would do first is plot the the marker gene and gene set expression scores from SCPCL000822 and SCPCL000824 together, to see where theSCPCL000824 scores fall relative to the clear bimodal scores from SCPCL000822. Once we have that, we may have a better sense of where to go next. Hopefully we won't have to go full mixture model, but we might...

I just want to note that for classifying cells using marker gene expression we are:

This same approach is being done for both methods and they both show bimodal distributions of marker genes.

SCPCL000822

Screenshot 2024-06-18 at 4 48 43 PM

SCPCL000824

Screenshot 2024-06-18 at 4 49 25 PM

As for the gene set scores, we never actually used the gene set scores to classify cells in SCPCL000822, we only used them to validate that the gene set scores were indeed higher in tumor cells than normal cells. You are correct that here the gene set scores are not a bimodal distribution like they were in SCPCL000822.

image

So we didn't actually have any cutoffs where we could use the same value in both, but I did test taking the mean of each of the distributions for SCPCL000822 and using those as the cutoffs (I also just estimated cutoffs from the plot that would make sense which are pretty equivalent to the mean). When I do that the results don't really change in terms of which cells get called as tumor cells.

This is an example plot from classifying using one of the gene sets: image

I also want to note that I did try using AUCell to classify these cells with each of the gene sets and because of the lack of bimodal distribution, that didn't work very well. Only a few cells were classified as tumor and there was no clear cutoff.

Generally, samples look more like this one where most cells are tumor cells so there probably won't be a clear bimodal distribution.

On the other hand, I am not sure I am comfortable with using SingleR with only 2 classes: it seems likely that it will overfit those two groups. Not sure the immediate solution here, though we could perhaps see what happens with multiple references: one with just tumor-normal, one with more cell types from normal tissue?

I wonder if we could combine a celldex reference with the tumor cells from SCPCL000822 to create a custom reference? This way we would have more than the two classes. This might avoid the overfitting problem and allow for SingleR to hopefully better distinguish between potential normal cell types and then tumor cells.

jashapiro commented 3 months ago

This same approach is being done for both methods and they both show bimodal distributions of marker genes.

We expect individual genes to have bimodal distributions (zero counts), but in the first example the distribution of the sums was also bimodal:

Screenshot 2024-06-18 at 6 58 30 PM

I might want look at the distribution of the sums without the transformations though: that was what I meant by comparing between the samples to find a cutoff. Zero was nice when it worked, but with a different balance of tumor/normal it is probably showing its limits, and we will have to go back to something more "raw."

I also want to note that I did try using AUCell to classify these cells with each of the gene sets and because of the lack of bimodal distribution, that didn't work very well. Only a few cells were classified as tumor and there was no clear cutoff.

I'd definitely be interested to see what some of your results were with AUCell as well. (Note that your marker gene list is also a potential gene set to look at). Again, I'm most curious about whether you might be able to set AUC cutoffs from 822 that might be able to be applied to other samples, without having to find a bimodal distribution of AUC in every data set. I'm not sure how consistent to expect the cutoffs to be, but it would be worth a look.

I wonder if we could combine a celldex reference with the tumor cells from SCPCL000822 to create a custom reference? This way we would have more than the two classes. This might avoid the overfitting problem and allow for SingleR to hopefully better distinguish between potential normal cell types and then tumor cells.

This is kind of what I was thinking, using the multiple reference functionality of SingleR. I kind of expect that you could keep the "Normal" label in the Tumor/normal reference from 822, as it should not fit better than any of the more specific references, and could serve as a bit of a negative control. Or maybe you could use the SingleR calls for 822 "normal" cells, but replace all of tumor calls with the "tumor" label. Maybe try both.

allyhawkins commented 3 months ago

@jashapiro I did quite a bit of reorganization and made a bunch of changes based both on feedback here and in #537. Rather than seeking to create final annotations here, I am exploring different methods for classifying these cells as tumor cells that rely solely on gene expression.

SCPCL000824_tumor-gene-expression-exploration.html.zip

I might want look at the distribution of the sums without the transformations though: that was what I meant by comparing between the samples to find a cutoff. Zero was nice when it worked, but with a different balance of tumor/normal it is probably showing its limits, and we will have to go back to something more "raw."

Part of the notebook looks at the sum of the normalized gene expression counts for each marker gene and compares the distributions across both samples. I also used the information from 822 to define a cutoff to use for 824 since the distributions cover the same range of values but do not show the same bimodal distribution.

I'd definitely be interested to see what some of your results were with AUCell as well. (Note that your marker gene list is also a potential gene set to look at). Again, I'm most curious about whether you might be able to set AUC cutoffs from 822 that might be able to be applied to other samples, without having to find a bimodal distribution of AUC in every data set. I'm not sure how consistent to expect the cutoffs to be, but it would be worth a look.

I added a section where I looked at AUCell with both gene sets and marker genes. I think it does a nice job in 822 (I confirmed by comparing the annotations to our validated annotations). Because of the lack of bimodal distribution in 824 it doesn't do a very good job, but I used the AUC cutoff from 822 to classify 824 and it looked much more like what I would expect given individual marker gene expression.

I also explored using UCell as recommended in https://github.com/AlexsLemonade/OpenScPCA-analysis/discussions/537#discussioncomment-9840032. Here we get a gene set signature score, similar to the score we are calculating on our own. With UCell you don't see quite the same clear bimodal distribution in the marker genes, but you do in 2 of the gene sets.

This is kind of what I was thinking, using the multiple reference functionality of SingleR. I kind of expect that you could keep the "Normal" label in the Tumor/normal reference from 822, as it should not fit better than any of the more specific references, and could serve as a bit of a negative control. Or maybe you could use the SingleR calls for 822 "normal" cells, but replace all of tumor calls with the "tumor" label. Maybe try both.

This notebook was getting quite long, so I'm going to file an issue to create a separate notebook that looks at using SingleR to classify these cells as a follow up to this notebook.

I did have a couple thoughts about file size issues too, before I go. One is that you could just change this to output .tsv.gz and things would shrink a lot. Alternatively you could write out files in some other format, like parquet or arrow. The disadvantage of all of these is that you loose the ability to easily diff the files, but for intermediate files that might be okay, and having them in the repo for easier review/inspection by outside users may be worth the tradeoff.

For right now, I am just saving this as a .gz file. I think you are right about having a larger discussion re: files that should be in the repo vs. not in the repo.

allyhawkins commented 3 months ago

I made a number of small adjustments to address your comments @jashapiro:

Another thing I did here was move some things based on our conversation of not committing tables earlier. I moved all the notebooks for validating tumor cells to exploratory_analysis/annotation_notebooks and then all TSV files are now in results/annotation_tables.

Updated notebook for review: SCPCL000824_tumor-gene-expression-exploration.html.zip

allyhawkins commented 2 months ago

@jashapiro I incorporated your text suggestions and made some minor additional edits based on your comments:

Updated notebook: SCPCL000824_tumor-gene-expression-exploration.html.zip