AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
9 stars 17 forks source link

Start an analysis for Wilms tumor annotation (SCPCP000014) #764

Closed JingxuanChen7 closed 2 months ago

JingxuanChen7 commented 2 months ago

Purpose/implementation Section

In this PR section, I'm trying to initialize an analysis module skeleton for Wilms tumor dataset SCPCP000014, as well as checking in scripts for pre-processing the provided sce objects.

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

This PR files scripts for pre-processing provided sce objects using a standard Seurat workflow, which would be used in the following analysis in my next PR.

Briefly describe the general approach you took to achieve this goal.

A standard Seurat workflow (normalization, feature selection, PCA, clustering, dimensional reduction, etc) was applied to all 10 samples in this dataset.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

s3://researcher-009160072044-us-east-2/cell-type-wilms-tumor-14/results/00_preprocessing_rds/

What types of results does your code produce (e.g., table, figure)?

Intermediate RDS files containing pre-processed Seurat objects.

What is your summary of the results?

Results for this section contains 10 .rdsSeurat objects for further analysis.

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for. This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

For the first PR, I would like to make sure if my way to setup the module skeleton, computing environment and documentation can meet the need.

Is there anything that you want to discuss further?

I have some questions for this PR:

  1. I'm a bit confused by the scope/purpose of GitHub issues. Should we open a new issue for each PR, or multiple PRs could be included to resolve one issue?
  2. I'm using a conda environment to install all the R packages needed so far. Should I add my code to setup conda env in the Dockerfile, or is it OK to leave it as is? When I ran the code on virtual computer, I didn't use Docker in general.

Thanks for reviewing!

Author checklists

Check all those that apply. Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

JingxuanChen7 commented 2 months ago

Hello @jashapiro , thanks for all the help!

As for the module setup itself, I very much appreciate you setting up a distinct environment for your work with all of your package dependencies included. It seems like your project is mostly based in R? If that is the case, we would in general prefer that you set up your environment with renv, as that allows us to more easily track the versions of specific packages and dependencies. The versions available on conda are often a bit out of date as well, so this allows us to keep up better with current versions of Bioconductor, etc.

If you would like any help setting that up, please let me know, or if you have a compelling reason that you would prefer to stick with conda for this purpose, we can also consider that for this project.

Thanks so much for the suggestion. I understood that using renv is much easier to manage R packages. I have made changes transferring to log versions with renv in my latest commit (https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/764/commits/8e696213eb9b8d9c712fdf42a7fca0ce9e1489d6).

Since I think it makes sense to start with the project as a whole, I did not spend too much time looking at your processing code. I did note, however, that you are performing normalization and dimension reduction within Seurat. This may be required by your downstream analysis, but I will note that we do include default normalized and dimension reduction matrices, and it may be more efficient to maintain the pre-computed values in the SingleCellExperiment object rather than recalculating them. If this is something you would like to pursue, I am happy to answer questions about where those values are located. The objects and processing steps are generally described in the portal documentation at https://scpca.readthedocs.io/en/stable/sce_file_contents.html

Thanks for the information. In my preprocessing code, I modified some parameters in the Seurat workflow (e.g number of features in the feature selection, whether to run harmony since I may merge samples in following analysis). In the future, I may want to try different feature selection or clustering algorithms, since cell types are not split out in some samples (preliminary results, not shown in this PR). Therefore, I would like to keep my codes for pre-processing as of now.

I think all the comments have been resolved in my last two commits. Please let me know if I should make any other changes. Thanks again for all the help!

JingxuanChen7 commented 2 months ago

Hi @jashapiro , thank you so much for all the suggestions on coding robustness and style. I really appreciate it!

Regarding the gene symbol vs. Ensembl ID, I keep Ensembl ID at this stage. So I also updated result .rdsSeurat files on S3 bucket. Regarding doublets, I added a column in the metadata instead of removing them. Other minor coding style changes have been included in my latest commit https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/764/commits/9be5d74171dfd607f8581ad126e60715023e05ba .

Thanks again for the careful code review. Let me know if anything else needs to be modified!

jashapiro commented 2 months ago

Oh, I did have one more comment, which was that you might want to store the .rdsSeurat files just in the scratch/ directory. I am not sure you will need to syncy them to S3, as at this stage they are not really results files so much as reformatted data files. I was able to verify that you are successfully syncing though, which has some value for the future!

JingxuanChen7 commented 2 months ago

Oh, I did have one more comment, which was that you might want to store the .rdsSeurat files just in the scratch/ directory. I am not sure you will need to syncy them to S3, as at this stage they are not really results files so much as reformatted data files. I was able to verify that you are successfully syncing though, which has some value for the future!

Hi @jashapiro , thank you for reminding me about the scratch/ folder! It's more appropriate to put the intermediate files to scratch/ instead of results/. I would re-sync the results in my next PR.

In addition, I applied changes as suggested in commit https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/764/commits/1731730a028e1a6a044cfb2b460a21deca3ee074 Again, I appreciate all the suggestions, which are really helpful!