01_clustering - Githubissues

Purpose/implementation Section

Hi DataLab team,

I apologies, I messed something up... I created a new branch maud-p-01-clustering that I wanted to be associated with new changes in this pull request. But for some reason, the changes are in the main branch (https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/672/commits)... And I don't know how to open a new pull request in the main branch as there is already one open... I am sorry for that, If you can let me know how to do it best next time would be great, thank you!

For now, in order to show a bit of progress, I described below the changes I did in my module cell-type-wilms-tumor-06 and you can track the commit here (https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/672/commits)._

Please link to the GitHub issue that this pull request addresses.

https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/679 https://github.com/AlexsLemonade/OpenScPCA-analysis/discussions/635

What is the goal of this pull request?

The main addition to the module is a RMardown report for one Wilms tumor sample (dataset SCPCP000006, sample SCPCS000169). The aim would be to discuss the report and improvement before adapting and rendering it to all samples in the dataset.

What does this pull request contain?

The pull request contains the rmd file 01-clustering_SCPCS000169.Rmd in the cell-type-wilms-tumor-06 folder and the html report in the notebook folder.

The dockerfile required to build the docker image and start RStudio has been updated with packages required.

We added clinical data from the dataset SCPCP000006_metadata.tsv to better track and understand the sample if needed.

Briefly describe the general approach you took to achieve this goal.

The analysis has been performed as the following:

[0] We build a seurat object based on the counts data and went through the seurat workflow [normalization --> reduction --> clustering] following the Seurat workflow.

[1] We performed some quality check to assess any QC-induced clustering (nFeature, nCount, percent.mito).

[2] We added cell cycle information, as we know that in a specific cell cycle state, the transcriptional program is mostly/exclusively related to cell cycle genes and the identity of cells is difficult to determine. We expect these cells to cluster together in a cluster of proliferating cells.

[3] We ran DElegate::FindAllMarkers2 to find markers of the different clusters and manually check if they do make sense. DElegate::FindAllMarkers2 is an improved version of Seurat::FindAllMarkers based on pseudobulk differential expression method.

[4] We looked at specific marker genes that we reported in the table marker.sets/CellType_metadata.csv to check the relevance of the clustering.

[5] We plot pca/umap reduction grouping with available annotations from the DataLab (singler, cellassign). We expect at least immune cells to be correctly label and fall into a few set of clusters.

[6] We ran label transfer (Azimuth) to transfer annotation from the fetal kidney atlas human reference. We plot pca/umap reduction grouping with latest labels. We expect it to be the nost representative of the cell types in the sample.

If known, do you anticipate filing additional pull requests to complete this analysis module?

[ ] After discussion with you, I will adapt and render the script to render it to all 40 samples of the dataset.

[ ] We will save for each sample the rds file

[ ] We will run inferCNV for each sample to decide the malignant/normal status of some stroma and epithelial cluster, and confirm the blastema annotation.

The next step will provide us a better understanding of the entire cohort. We will then have to set up a strategy to annotate each sample. Open questions are:

[ ] should we annotate single cell or [] consider applying similar annotations to all cells in a cluster?

[ ] manual annotation of each cluster / each patient or [] automated annotation using some threshold?

Results

What is the name of your results bucket on S3?

I haven't uploaded anything on S3 yet.

What types of results does your code produce (e.g., table, figure)?

html report/notebook.

What is your summary of the results?

The primary aim of this script was to assess the quality of clustering and play with clustering parameters. On this sample, we are happy with the clustering, as we could identify:

one cluster of immune cells (cell marker PTPRC, SingleR, Azimuth label transfer fro the fetal kidney)
one cluster of endothelial cells (cell marker VWF, SingleR, Azimuth label transfer fro the fetal kidney)
one set of clusters of stromal cells
one set of clusters of epithelial cells
one set of clusters of blastema cancer cells.

See selected plots to summarize the analysis.

Additionally, we showed the feasibility to run label transfer using runAzimuth (part of step 3 of the proposed analysis: https://github.com/maud-p/OpenScPCA-analysis/issues/1).

We also showed the easy and robust identification of normal cells (endothelial and immune cells) that would serve as reference for the CNV inference (inferCNV, step 4 of the analysis).

Provide directions for reviewers

I would like to have your opinion on the report before rendering it to all WT samples. [ ] Is the analysis worflow OK? Can we pursue this strategy [ ] Suggestion for improvment.

What are the software and computational requirements needed to be able to run the code in this PR?

Copy/pasted from the READ.ME file in OpenScPCA-analysis/analyses/cell-type-wilms-tumor-06/

To perform the analysis, run the RMarkdown script in R (version 4.4.1). The main packages used are:

Seurat version 5 Azimuth version 5 inferCNV SCpubr for visualization DT for table visualization DElegate for differential expression analysis For complete reproducibility of the results, you can build and run the docker image using the Dockerfile. This will allow you to work on RStudio (R version 4.4.1) from the based image bioconductor/tidyverse:3.19.

In the config.yaml file, define your system specific parameter and paths (e.g. to the data). Execute the run.sh file and open RStudio in your browser (http://localhost:8080/). By default, username = rstudio, password = wordpass.

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

[x] This analysis module uses the analysis template and has the expected directory structure.
[x] The analysis module README.md has been updated to reflect code changes in this pull request.
[x] The analytical code is documented and contains comments.
[ ] Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

[ ] Code in this pull request has been added to the GitHub Action workflow that runs this module.
[x] The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
[ ] If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
[ ] If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

AlexsLemonade / OpenScPCA-analysis

01_clustering #680