AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
1 stars 13 forks source link

Annotate tumor cells in remaining samples for SCPCP000015 #563

Open allyhawkins opened 2 months ago

allyhawkins commented 2 months ago

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

292

Describe the goals of the changes to the analysis module.

Now that we have spent some time exploring methods for annotating tumor cells in the Ewing's samples and have done a lot of validation of tumor cells in two samples, SCPCS000490 and SCPCS000492, we would like to be able to identify tumor cells in the remaining samples. We plan to use what we learned from these samples and annotate the remaining tumor cells.

In general we will need to complete the following steps:

  1. Identify a method that we can apply to all samples and create a script to run that on all samples in the project. I am leaning towards using AUCell for this with a pre-defined threshold that we determined with SCPCL000822 (see #532).
  2. Create a template report that will be used to evaluate these annotations. This will contain things like density plots showing the auc, marker gene expression, and gene set score distributions along with heatmaps of the marker gene and gene set scores to confirm that tumor cells show increased gene set expression over all other cells.
  3. Create a workflow that can be used to run both the script and generate the report for all samples in the project. This workflow should output a TSV file with the tumor/normal annotations and then a report.
  4. Run this workflow on all samples in the project and evaluate the reports. Here we should identify any samples that do not show clear separation between marker gene and gene set scores and evaluate further.

Once this has been completed then we can move onto identifying the normal cell types that are present in these samples. For this I imagine running SingleR using BlueprintEncodeData on the normal cells and then evaluating that the expected markers for those cell type are present. We have not yet done any validation of specific normal cell types which is why I think this will be a separate question we need to answer.

What will your pull request contain?

I plan on filing an issue to address each of the steps in the above analysis proposition. Each of these issues will correspond to one PR. This issue will be closed once all of those steps have been completed.

Will you require additional software beyond what is already in the analysis module?

No, we should have everything already set up.

Will you require different computational resources beyond what the analysis module already uses?

No.

If known, when do you expect to file the pull request?

No response

allyhawkins commented 1 month ago

Based on the findings from running AUCell we are going to have to take some additional steps to get final tumor annotations (See https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/567#issuecomment-2225778349 for some more context). I am updating this issue based on some next steps that we plan to take as of now:

allyhawkins commented 1 month ago

Using the workflow added in #659, I have completed annotating all samples using SingleR. I'm including here a general summary of the results. I've also included a zip file of all the reports that were generated here:

singler_reports_1.zip singler_reports_2.zip

Non-PDX libraries (822, 824, 825, 826, and 828):

PDX libraries (823, 1112, 1113, 1114, 1115, 1116):

827:

1111:

In looking at these reports, I actually think we have a good starting point for annotations and next steps should include refining the annotations obtained here (given we re-run the PDX samples). In thinking about refining these annotations, I think we want to start with clustering. We should obtain clusters we feel good about and then look at expression of the marker gene lists across those clusters. I would anticipate that tumor cells will cluster separately than normal cells and that normal cell clusters will show higher expression of the normal cell markers than tumor cell clusters and vice versa. Additionally, we want to be able to annotate tumor cell subpopulations which I think should be done by looking at clusters of tumor cells.

The only other thought I had was that we may be assigning more tumor cells here because the cells are more similar to other Ewing's tumor cells than any normal cell types in the HPCA/ blueprint references. It could be helpful to compare the annotations using only the tumor cells as reference to using a reference that contains both normal and tumor cells from Ewings samples and see if that shifts any of the tumor cell assignments. In particular, both 822 and 824 have clear groups of normal cells so we could create a reference that uses both of those samples as a reference rather than all of the tumor cells from all samples. This would mean cells could match either endothelial cells from HPCA or Blueprint or the endothelial cells in 822. I think this is worth doing, but probably after doing some refinement on those two samples by clustering and assigning cell types to all cells in that cluster.

Tagging @jashapiro in case you would like to see the current reports or have any thoughts. For now, I'm going to file these two last thoughts as issues for potential next steps.