01 clustering wilms 06 - Githubissues

maud-p commented 3 months ago

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

isuue #679

What is the goal of this pull request?

In this pull request, I would like to introduce one notebook that I use to get a first impression on a sample. I went from the basic approach (look at known markers), to more advances label transfer or enrichment analysis of marker genes. Across the 3 samples I started to look into, I realized some of the steps are crucial, some less, but would be gratefull for your input.

Briefly describe the general approach you took to achieve this goal.

For this analysis, we worked with the _processed.rds data. We builded a Seurat object based on the counts data and re-perform the analysis [normalization –> reduction –> clustering] following the Seurat workflow.

We transferred meta.data to keep:

QC data computed by the DataLab
annotation data computed by the DataLab
raw annotation and gene_symbol conversion

We perform the following analysis to assess for the quality of clustering and get a first impression on the sample:

[1] We perform some quality check to assess any QC-induced clustering (nFeature, nCount, percent.mito).

[2] We add cell cycle information, as we know that in a specific cell cycle state, the transcriptional program is mostly/exclusively related to cell cycle genes and the identity of cells is difficult to determine. We expect these cells to cluster together in a cluster of proliferating cells.

[3] We look at specific marker genes that we reported in the table marker.sets/CellType_metadata.csv to check the relevance of the clustering.

[4] We look at specific pathways that we reported in the table marker.sets/Pathways_metadata.csv to check the relevance of the clustering.

[5] We run DElegate::FindAllMarkers2 to find markers of the different clusters and manually check if they do make sense. DElegate::FindAllMarkers2 is an improved version of Seurat::FindAllMarkers based on pseudobulk differential expression method.

[6] We perform enrichment analysis of marker genes for each seurat clusters. We defined all the genes from the seurat object as the universe and used the MSigDB gene sets.

[7] We plot pca/umap reduction grouping with available annotations (singler, cellassign). We expect at least immune cells to be correctly label and fall into a few set of clusters.

[8] We run label transfer (Azimuth) to transfer annotation from the fetal kidney atlas human reference. We plot pca/umap reduction grouping with latest labels. We expect it to be the most representative of the cell types in the sample.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! Based on the present PR, I would like to refine the approach to annotate the cells in each sample. More PR are coming!

Results

I haven't send anything to S3. But I am wondering if I should load the fetal kidney reference for easy use of it? Else, I can load the different steps I used to get the fetal kidney atlas compatible woth azimut. I did it 2 years ago, were some steps with python and R. I could try to clean the docker container used for it and publish it in the module. To give you a feeling what to expect, I added the script I used to generate the fetal kidney reference : 01_fetal_reference_kidney.Rmd

Let me know what do you think!

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

RMarkdowm - html notebooks

What is your summary of the results?

I tested different approaches/tools to get a feeling of the cells composing the samples.

I think the first clustering is not too bad and could be used for visualization. This was the first aim of the PR :)

I found the most helpfull approach is to transfer label from the healthy fetal kidney reference. This allow to identify with high confidency (i) immune cells (normal), (ii) endothelial cells (normal), (iii) stromal cells (cancer+normal) and (iv) nephron cells (blastema+epithelial cancer + normal cells).

The use of known marker genes is usefull to check clustering but cannot be used imho to annotate cells.

Pathway enrichment analysis using the MSiGDB C8 gene set can be used imho to annotate cells, as cancer cells, blastema or primitive epithelium must be enriched in fetal kidney pathway while healthy/normal kidney should be enriched in mature/adulte differentiated kidney pathways.

Provide directions for reviewers

To annotate the cells, I would re-evaluate my strategy as the following: 1) label transfer from the fetal kidney reference and label cells with the predicted compartments, i.e. immune (normal), endothelial (normal), stroma (normal+cancer) and fetal nephron (normal+cancer).

2) I would subset the stroma and fetal nephron cells separetly and re-perform clustering step. I expect then normal/cancer subtype to be better segregated in a dataset of stronal cells or epithelial cells only. To check the new clustering, I will use a similar approach including marker genes, pathways, label transfer from the fetal kidney atlas.

3) run inferCNV to have an additional information that can allow to conclude normal versus cancer depending on low versus high number of rearrangments.

What are the software and computational requirements needed to be able to run the code in this PR?

I used RStudio and renv. Th erenv.lock file has been updated!

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

The strategy for the next step, if you think it makes sense!

Author checklists

Check all those that apply. Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

[x] This analysis module uses the analysis template and has the expected directory structure.
[ ] The analysis module README.md has been updated to reflect code changes in this pull request.
[x] The analytical code is documented and contains comments.
[ ] Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

[ ] Code in this pull request has been added to the GitHub Action workflow that runs this module.
[x] The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
[ ] If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
[x] If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jaclyn-taroni commented 3 months ago

Hi @maud-p - thank you for filing this! @sjspielman is going to be your initial reviewer.

maud-p commented 3 months ago

Thank you!

FYI, I just opened a ney issue (https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/703) and plan to add few steps to allow you and the community to re-build the azimuth compatible kidney reference. In the present PR, this is missing, as I wasn't able to upload the fetal kidney reference (files too large).

maud-p commented 3 months ago

Thank you @sjspielman for your detailed comments and advice! It is all clear and I expect to fill the discussed PR either today or tomorrow :)

AlexsLemonade / OpenScPCA-analysis

01 clustering wilms 06 #699

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist