04_annotation_Across_Samples_exploration

maud-p commented 1 month ago

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/774

This PR is following the discussion from the PR#750, especially: https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/750#pullrequestreview-2310191830

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

In this PR, I am adding one notebook in /notebook/04_annotation_Across_Samples_exploration.Rmd to explore the annotations and label transfers for all of the samples in SCPCP000006.

We integrated all the samples from SCPCP000006 to have a rapid and global view of label transfer. Please note that the integration is not the aim of this PR, this is just a way to display better genes and features.

In order to explore the label transfer results, we look into some marker genes, table and percentages of cells in each annotation groups (from label transfers).

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes, next step would be to run copyKAT

Results

This PR do not contain any result, only a single notebook.

What types of results does your code produce (e.g., table, figure)?

One notebook that explores for all samples at once clustering and label transfer results.

What is your summary of the results?

85.4925176 % of the cells are labeled as kidney cells (fetal full reference, looking at the fetal_full_predicted.organ. I think this is quite a nice result. From the umap and barplot, I think that most of the cells that are not labelled as kidney are endothelial or immune cells. (While writting this, I think it would be good to add a table of fetal_full_predicted.organ and fetal_kidney_predicted.compartment, maybe in the next round after your review!)
0.8660364 % of the cells are labeled as immune cells and 0.8226098 % of the cells are labeled as endothelial cells. As Wilms tumor is known to be a cold tumor (immune excluded), and COG Wilms tumor samples are mostly not pre-treated, it is quite expected to have very few immune cells. If this is a problem for running copyKAT with very few cells as a reference, I have no idea to be honnest.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I updated the renv.lock file, else no specific changes since the last PR :)

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

What about the next step, what do you think how we can run cpyKAT/inferCNV? I think I should start with copyKAT and try to run with and without a reference of normal cells, and try to evaluate what is the impact of the annotation of normal cells on the infered CNV.

Author checklists

Analysis module and review

[x] This analysis module uses the analysis template and has the expected directory structure.
[x] The analysis module README.md has been updated to reflect code changes in this pull request.
[x] The analytical code is documented and contains comments.
[ ] Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

[x] Code in this pull request has been added to the GitHub Action workflow that runs this module.
[x] The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
[ ] If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
[x] If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

maud-p commented 1 month ago

Hi @sjspielman , Thank you for the rapid feedback 🏎️ ! I just reloaded the results to the s3 bucket, might have been a bug during the transfer to s3 because my local results directory does contain the 3 rds objects per samples. I'll plot the different tables and come back to you, thanks for the few lines of codes to extract the metadata :)

maud-p commented 1 month ago

@sjspielman I modified the notebook based on your suggestion, it is quite simple now but let me know if you thing to additional plots and checks that can be useful :)

Thank you again for your help!

maud-p commented 1 month ago

Dear @sjspielman , thank you so much for the review and the suggestions, including codes! Was really useful, as usual!

I should have made all suggested changes :D

One additional table that I would like to discuss is the sumary, for each compartment, of the percentage of cells that do match kidney annotation or not.

The majority of fetal nephron cell (92%) has been predicted as kidney. However, the other compartments (stroma, endothelial and immune) do not really match to kidney cells. This is in my opinion not a concern and shouldn't be interpreted as a poor label transfer or annotation! I'll try to argue a bit why.

For the stroma compartment: it is known that Wilms tumor stroma (sometimes) shows (unexpected) differentiation into cell types such as skeletal muscle cells, fat tissue, cartilage, bone and even glial cells [1-2]. We also saw it on H&E staining of Wilms tumor biopsies. For that reason, I am not surprised that most stroma cells are not predicted as kidney cells.
For the immune compartment: it wouldn't be surprising that cancer cells, and/or treatment, modulate the immune microenvironment, via the attraction of immune cells that are not usually in the kidney and/or induction of a cancer-associated phenotype.

Thank you again for your help!

sjspielman commented 1 month ago

Ah, one more thing I forgot!

We should add rendering this notebook to the 00_run_workflow.R script. This step should be after/outside the for loop (since we only knit this once, not for each sample), and only run if we are not testing (since the input files for this notebook aren't generated in testing).

maud-p commented 1 month ago

I ended up selecting the following samples:

sample SCPCS000194 has > 85 % of cells predicted as kidney and 234 + 83 endothelium and immune cells.
sample SCPCS000179 has > 94 % of cells predicted as kidney and 25 + 111 endothelium and immune cells.
sample SCPCS000184 has > 96 % of cells predicted as kidney and 39 + 70 endothelium and immune cells.
sample SCPCS000205 has > 89 % of cells predicted as kidney and 92 + 76 endothelium and immune cells.
sample SCPCS0000208 has > 95 % of cells predicted as kidney and 18 + 35 endothelium and immune cells.

I tried to enriched in samples having >100 normal cells, without decreasing too much the prediction of kidney cells. Would it be OK like this?

Thank you again for your help!!

maud-p commented 1 month ago

Great, thank you very much!

Thank you very much for letting me know about your days off. Then I might take the time to compare few different methods (copyKAT +/- reference, inferCNV +/- reference) on the few samples!

For the workflow, did I understood that it would be best to have:

R scripts for steps that are building the final object
notebooks for reports and results explorations?

Should I try in another PR to split the first 3 notebooks into scripts + notebook?

Thank you!

sjspielman commented 1 month ago

For the workflow, did I understood that it would be best to have:

R scripts for steps that are building the final object

notebooks for reports and results explorations?

Yes, this is the idea. Notebooks are generally the better option for steps that are exploratory or interactive in some way - e.g. making tables and plots. Scripts are often the better option for running an analysis that you plan to explore in a notebook. For example, in the doublet-detection module I wrote, you can see that (for initial benchmarking steps) I used a script to detect doublets, and then a notebook to explore the doublet results. The parallel here would be to run copyKAT in a script which would save the copyKAT results as TSV files (this output doesn't need to be the whole Seurat object - we can save a little storage space with TSV instead!), and then explore those results in a notebook, which might also be a template notebook that looks at one sample at a time with params.

That said, this is not a strict rule - you can still use a notebook to run copyKAT if you feel more comfortable with that approach!

sjspielman commented 1 month ago

Should I try in another PR to split the first 3 notebooks into scripts + notebook?

Don't worry about this at all! The code you have written so far is completely fine. Again, not a strict rule :)

AlexsLemonade / OpenScPCA-analysis