Wilms tumor 06- clustering exploration

maud-p commented 3 weeks ago

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

this is the following work on PR #704 taking into account the changes in PR #737

What is the goal of this pull request?

The aim here is to explore the clustering and label transfer from the 2 fetal references for each sample.

Briefly describe the general approach you took to achieve this goal.

Here I started from the output of the notebook 02b_label-transfer_fetal_kidney_reference_Stewart.Rmd that contains:

Normalisation with SCTransform,
dimensionality reduction PCA, UMAP,
clustering and
label transfer from the 2 fetal references

and explored the results looking at:

marker genes (from a list of known marker genes and from differential expression analysis, as a more "discovery" approach)
enriched pathways (from a list of known marker genes and from differential expression analysis, as a more "discovery" approach)

I compared the labels obtained from SingleR, CellAssign and the label transfer from the two fetal references (PR #737).

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! More than one. I think that from this analysis, I can find a way to annotate healthy cells such as "immune" and "endothelial cells". From here, I will be able to fill a new PR to include inferCNV and/or copyKAT to the template.

Results

The notebook template produce a notebook per sample in notebook/{sample_id} folder. I have now uploaded the notebooks for the 2 first samples. Once we have discussed the analysis, I'll run for the 40 samples and add the notebooks!

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

notebook

What is your summary of the results?

Provide directions for reviewers

What do you think?

What are the software and computational requirements needed to be able to run the code in this PR?

I render the notebook from the 00_run_workflow.R script. I open a new loop on purpose in order not to run everythink from PR #737 again, but I guess in a final step all the notebook will be ran i the same loop!

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

I like to have your opinion on the best way to go to select normal cells as input for inferCNV. I am quite satisfyied by the labels from the fetal kidney reference fetal_kidney_predicted.compartment divided into:

endothelial
immune
fetal kidney
stroma

I think that we can safely take the immune and endothelial cells as healthy reference and run inferCNV from here. Then, with the result of inferCNV, I hope to be able to further split the fetal kidney and stroma compartment into normal and cancer blastema, epithelial and stroma cells.

Author checklists

Check all those that apply. Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

[x] This analysis module uses the analysis template and has the expected directory structure.
[x] The analysis module README.md has been updated to reflect code changes in this pull request.
[x] The analytical code is documented and contains comments.
[ ] Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

[x] Code in this pull request has been added to the GitHub Action workflow that runs this module.
[x] The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
[ ] If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
[x] If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

maud-p commented 3 weeks ago

Thank you @sjspielman for looking into it :)

I have added the renv.lock modified, sorry, I missed it in my git add!

Regarding the for loop I agree and will move it into the previous for loop. It was for me a way not to re-run the notebook 00-, 01- 02- and get quicker to the new results.

Hi @maud-p, thanks for filing this next PR!

I'm going to start having a careful look for review, but first there are two quick things I see off the bat that you can start working on if you want! For one, I left you a separate inline comment. Second, it doesn't look like the renv.lock file is up to date with additional packages used in this notebook. Can you please snapshot to update the lockfile? Thanks!

maud-p commented 3 weeks ago

Hi @sjspielman , Thank you again for your review and advice! I just committed the changes, let me know what do you think!

For the first round of review here, I've looked things over for clarity and correctness. Overall it looks like it's in great shape!! After this first round, I'll do another round of review more focused on the science. Here are some comments in addition to the others I left inline:

Can you update the results/README.md to include this notebook?

I haven't updated the results/README.md for now because the notebook 03_clustering_exploration.Rmd is not saving any output in results. I am just generating a report in notebook. So I updated the README.md file in the analysis module. Should I create one for the notebook directory?

The functions you added & docs about them look great, thanks for doing that! It really makes the notebook easier to read and work with :) Let's just do a bit more reorganization:

Can you scoot up the "Functions" section (which looks great, by the way!) to be above "Analysis" but after "Introduction"?

Can you order the functions in the same order that they are used in the actual notebook?

I'm not sure the alluvial plots (while very cool!) are the easiest to read, because of how long the cell type labels are. Is it possible to make that font small enough to be able to read the labels clearly? If not (or alternatively), I wonder if a heatmap might be a clearer plot to make here that shows counts of cells in each combination of groups? There are more complicated statistics that one could show in a heatmap comparing these groupings, but for an exploratory notebook like this I think just counts are probably sufficient. One way to make this plot would be (vs using existing heatmap packages) to use ggplot2::geom_rect(). You can create new data frame that counts the combinations of cluster/annotation (dplyr::count() can help for this!) and then plot cluster & annotation against each other, with a fill aesthetic of the actual counts. Let me know if this makes sense or how I can further explain!

I tried to go for both:

I tried to improve the alluvial plots switching for sankey plot, which should be basically the same jsut with some space between categories. Unfortunatelly, SCpubr:: is not maintaining their nice do_SankeyPlot function, so I copy/pasted some of their old source code.
I went also for the heatmap of counts for each of the categories as you suggested (if I understood correctly!). Not sure how to write the documentation for the function, I went for the shortest version!

The aim of these two approach is to show that whatever method we choose for labelling the cells (full or kidney fetal reference), they seem to converge for the identification of endothelial and immune cells.

This is important as I like to use it as the next step for running inferCNV.

Would this make sense?

thank you!!

sjspielman commented 3 weeks ago

@maud-p just a quick heads up that I'm out of the office now at the AACR Pediatric conference, so I will be back to review this and chat about inferCNV next week. Have a good weekend in the meantime!

maud-p commented 3 weeks ago

@maud-p just a quick heads up that I'm out of the office now at the AACR Pediatric conference, so I will be back to review this and chat about inferCNV next week. Have a good weekend in the meantime!

Hi @sjspielman , thanks for letting me know, hope you enjoy the conference! I'll also be on conference next week 16-18 September FYI :)

maud-p commented 2 weeks ago

Dear @sjspielman , thank you very much for the review and detailed explanations! I'll work on it hopefully tomorrow or latest Thursday ;)

I agree adding more heatmaps/comparisons ! Regarding inferCNV/copyKAT, you have a great point here. I was thinking using inferCNV as you suggested for copyKAT, but might be better to do it in two steps then: 1) copyKAT to help annotating malignant versus normal and 2) inferCNV to confirm CNV from copyKAT in the malignant cells?

Thank you!!

sjspielman commented 2 weeks ago

I was thinking using inferCNV as you suggested for copyKAT, but might be better to do it in two steps then

Yes, I think this is probably the way to go - copyKAT can (maybe!) help us identify tumor vs. normal, and inferCNV can potentially be used to validate some of those calls. When we get there, we'll want to do this one sample at a time since results will probably be really different among samples!

maud-p commented 2 weeks ago

Hi @sjspielman , I think I adressed your comments/suggestions :)

I looked for ~ 10 samples the comparisons of the 2 fetal references annotations, it seems to fit quite well for the endothelial and immune cells.

I like the fetal kidney reference the most, as the annotation are quite simple, but also detailed enough for our purpose.

My though would be to go with the fetal_kidney_predicted_compartment and

[ ] run copyKAT using endothelial and immune as healthy cells
[ ] subset and re-cluster the fetal_nephron to see if we can achieve better segregation of cells (healthy versus cancer, epithelial versus blastema)
[ ] subset and re-cluster stroma to see if we can achieve better segregation of cells (healthy versus cancer)

Another option would be to use for copyKAT cells that are annotated using both label transfer of fetal references as endothelal or immune. A bit more complex to write maybe, but might be the safest way to identify true endothelial and immune cells?

Let me know what do you think!

Thank you!!

maud-p commented 2 weeks ago

Dear @sjspielman , I should have made the few changes and added the last html notebooks :)

For some reasons, I got an error for one of the sample SCPCS000197, I will have a closer look why and update you on this.

But I wanted to already share with you the notebooks. I had a look at few of the reports, and it seems that the different annotation strategies converge in the identification of endothelial and immune cells. I especially like the fetal kidney reference, fetal_kidney_predicted.compartment, which also seems to perform quite well looking at the dotplots of marker genes.

FYI, I will be away from tomorrow until next Thursday, I'll be at the SIOP RTSG. I hope to hear & learn new relevant insights for Wilms tumor!

Thank you!

sjspielman commented 1 week ago

For some reasons, I got an error for one of the sample SCPCS000197, I will have a closer look why and update you on this.

I'll have a look at this sample and see if I can track down the problem.

But I wanted to already share with you the notebooks. I had a look at few of the reports, and it seems that the different annotation strategies converge in the identification of endothelial and immune cells. I especially like the fetal kidney reference, fetal_kidney_predicted.compartment, which also seems to perform quite well looking at the dotplots of marker genes.

Thanks again for sharing all of these! I'll look through them all and see if we come to the same conclusions.

maud-p commented 1 week ago

It worked, thank you so much @sjspielman ! I added the last notebook 🎉

maud-p commented 1 week ago

Dear @sjspielman ,

thank you very much, these all makes lot of sense. I will re-run the analysis and should upload the notebooks by Thursday (will be travelling tomorrow, not sure how I'll have access to our server).

I like the idea to look at all samples, I'll work on a notebook and start a new PR :)

Thank you!

maud-p commented 1 week ago

Dear @sjspielman, thank you very much !!! I am working on the next PR, I'll get back to you soon!

Thank you very much for your help and effort to make it work!!!

AlexsLemonade / OpenScPCA-analysis