292

Describe the goals of the changes to the analysis module.

Now that we have spent some time exploring methods for annotating tumor cells in the Ewing's samples and have done a lot of validation of tumor cells in two samples, SCPCS000490 and SCPCS000492, we would like to be able to identify tumor cells in the remaining samples. We plan to use what we learned from these samples and annotate the remaining tumor cells.

In general we will need to complete the following steps:

Identify a method that we can apply to all samples and create a script to run that on all samples in the project. I am leaning towards using AUCell for this with a pre-defined threshold that we determined with SCPCL000822 (see #532).
Create a template report that will be used to evaluate these annotations. This will contain things like density plots showing the auc, marker gene expression, and gene set score distributions along with heatmaps of the marker gene and gene set scores to confirm that tumor cells show increased gene set expression over all other cells.
Create a workflow that can be used to run both the script and generate the report for all samples in the project. This workflow should output a TSV file with the tumor/normal annotations and then a report.
Run this workflow on all samples in the project and evaluate the reports. Here we should identify any samples that do not show clear separation between marker gene and gene set scores and evaluate further.

Once this has been completed then we can move onto identifying the normal cell types that are present in these samples. For this I imagine running SingleR using BlueprintEncodeData on the normal cells and then evaluating that the expected markers for those cell type are present. We have not yet done any validation of specific normal cell types which is why I think this will be a separate question we need to answer.

What will your pull request contain?

I plan on filing an issue to address each of the steps in the above analysis proposition. Each of these issues will correspond to one PR. This issue will be closed once all of those steps have been completed.

Will you require additional software beyond what is already in the analysis module?

No, we should have everything already set up.

Will you require different computational resources beyond what the analysis module already uses?

No.

If known, when do you expect to file the pull request?

No response

allyhawkins commented 1 month ago

Based on the findings from running AUCell we are going to have to take some additional steps to get final tumor annotations (See https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/567#issuecomment-2225778349 for some more context). I am updating this issue based on some next steps that we plan to take as of now:

Refine tumor cell annotations for SCPCS000752 #606
Create a reference to use for SingleR #607
Create a script to run SIngleR using that reference on all remaining samples
Generate a template report for evaluating SingleR results
Add to existing AUCell workflow or create a new workflow just for running SingleR

allyhawkins commented 1 month ago

Using the workflow added in #659, I have completed annotating all samples using SingleR. I'm including here a general summary of the results. I've also included a zip file of all the reports that were generated here:

singler_reports_1.zip singler_reports_2.zip

Non-PDX libraries (822, 824, 825, 826, and 828):

Generally more cells are called as tumor cells than with using marker gene based methods or AUCell alone.
We still see small groups of normal cells, with 822 and 824 having the most normal cells.
The tumor cells consistently show higher expression of tumor marker genes and gene sets over all other cells.
The normal cell markers for endothelial, immune, and mesenchymal-like are highest in the cell types that are annotated in those groups, which is promising.
Although we do see normal cell types, there are some instances where I would anticipate clusters to have mixes of normal and tumor cell types just based on where they are on the UMAP. However, that's really all hypothetical and until we do any clustering we wouldn't know that.
There are occasions where the tumor cells don't have any expression of marker genes from our tumor marker gene list (seen in 824, 826, and 828). However, they do have higher gene set scores, so perhaps those cells are still tumor cells but don't express our specific markers. Alternatively, they are incorrectly identified as tumor cells just because their gene expression mirrors that of other tumor cells over any of the normal cells in the reference.

PDX libraries (823, 1112, 1113, 1114, 1115, 1116):

Pretty much everything is a tumor cell and there's no real clear separation between marker gene expression and gene set scores between tumor and normal cells.
If normal cells are annotated it's maybe one type, but in some cases no normal cell types are found. I think this could be in part due to using a human normal reference over mouse.
I plan on re-running these samples using the mouse normal reference mentioned in #666.

827:

This was the one non-PDX library that actually looked a little bit more of a PDX sample where every cell was tumor other than a handful of normal cells.
There was no clear difference in marker gene expression or gene set scores between tumor and normal, so I think this might be a case where we do in fact have all tumor cells.

1111:

This library is really poor quality and only has a few cells. Marker gene expression is also very low in the cells compared to looking at other libraries. Most cells are identified as tumor, but generally I don't really trust anything in this sample.

In looking at these reports, I actually think we have a good starting point for annotations and next steps should include refining the annotations obtained here (given we re-run the PDX samples). In thinking about refining these annotations, I think we want to start with clustering. We should obtain clusters we feel good about and then look at expression of the marker gene lists across those clusters. I would anticipate that tumor cells will cluster separately than normal cells and that normal cell clusters will show higher expression of the normal cell markers than tumor cell clusters and vice versa. Additionally, we want to be able to annotate tumor cell subpopulations which I think should be done by looking at clusters of tumor cells.

The only other thought I had was that we may be assigning more tumor cells here because the cells are more similar to other Ewing's tumor cells than any normal cell types in the HPCA/ blueprint references. It could be helpful to compare the annotations using only the tumor cells as reference to using a reference that contains both normal and tumor cells from Ewings samples and see if that shifts any of the tumor cell assignments. In particular, both 822 and 824 have clear groups of normal cells so we could create a reference that uses both of those samples as a reference rather than all of the tumor cells from all samples. This would mean cells could match either endothelial cells from HPCA or Blueprint or the endothelial cells in 822. I think this is worth doing, but probably after doing some refinement on those two samples by clustering and assigning cell types to all cells in that cluster.

Tagging @jashapiro in case you would like to see the current reports or have any thoughts. For now, I'm going to file these two last thoughts as issues for potential next steps.

AlexsLemonade / OpenScPCA-analysis

Annotate tumor cells in remaining samples for SCPCP000015 #563

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

292

Describe the goals of the changes to the analysis module.

What will your pull request contain?

Will you require additional software beyond what is already in the analysis module?

Will you require different computational resources beyond what the analysis module already uses?

If known, when do you expect to file the pull request?