Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step

maud-p commented 3 months ago

Purpose/implementation Section

In this module 1, I create 2 metadata tables to compile from the literature information on marker genes and known genetic alterations, that will be used later to validate annotations of the Wilms tumor dataset.

Please link to the GitHub issue that this pull request addresses.

https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/671 https://github.com/AlexsLemonade/OpenScPCA-analysis/discussions/635#discussioncomment-10140478

What is the goal of this pull request?

Wilms tumor (WT) is the most common pediatric kidney cancer characterized by an exacerbated intra- and inter- tumor heterogeneity. The genetic landscape of WT is very diverse in each of the histological contingents. The COG classifies WT patients into two groups: the favorable histology and diffuse anaplasia. Each of these groups is composed of the blastemal, epithelial, and stromal populations of cancer cells in different proportions, as well as cells from the normal kidney, mostly kidney epithelial cells, endothelial cells, immune cells and normal stromal cells (fibroblast).

In this module, we reviewed the literature to compile a table of marker genes for each of the expected cell types in the dataset. Additionally, we provide a table of know genetic alterations in Wilms tumor that can be useful to validate CNV profiles obtained after running inferCNV function.

Briefly describe the general approach you took to achieve this goal.

The table CellType_metadata.csv contains the following column and information:

"gene_symbol" contains the symbol of the described gene, using the HUGO Gene Nomenclature
ENSEMBL_ID contains the stable identifier from the ENSEMBL database
cell_class is either "malignant" for marker genes specific to malignant population, or "non-malignant" for markers genes specific to non-malignant tissue or "both" for marker genes that can be found in malignant as well as non-malignant tissue but are still informative in respect to the cell type.
cell_type contains the list of the cell types that are attributed to the marker gene
DOI contains the list of main publication identifiers supporting the choice of the marker gene
comment can be empty or contains any additional information

The table GeneticAlterations_metadata.csv contains the following column and information:

alteration contains the number and portion of the affected chromosome
gain_loss contains the information regarding the gain or loss of the corresponding genetic alteration
cell_class is "malignant"
cell_type contains the list of the malignant cell types that are attributed to the marker gene, either blastemal, stromal, epithelial or NA if none of the three histology is more prone to the described genetic alteration
DOI contains the list of main publication identifiers supporting the choice of the genetic alteration
comment can be empty or contains any additional information

If known, do you anticipate filing additional pull requests to complete this analysis module?

This module will be used for later validation of the annotations and results from inferCNV.

What is the name of your results bucket on S3?

Results should be uploaded to your bucket so they are available during review. See here for instructions on how to upload to your bucket: https://openscpca.readthedocs.io/en/latest/software-platforms/aws/working-with-s3-buckets/

What types of results does your code produce (e.g., table, figure)?

2 tables

Provide directions for reviewers

This section had 2 aims:

learn how to build the github repository, perform issue, pull request
gather literature information into a metadata file for later use for validation of the annotations
What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply. Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

[ ] This analysis module uses the analysis template and has the expected directory structure.
[x ] The analysis module README.md has been updated to reflect code changes in this pull request.
[ ] The analytical code is documented and contains comments.
[ ] Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

[ ] Code in this pull request has been added to the GitHub Action workflow that runs this module.
[ ] The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
[ ] If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
[ ] If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

maud-p commented 3 months ago

Dear Jaclyn,

Thank you very much for your encouraging comment and all the information. It is really useful to understand the expected structure of the final folder/module. I will commit the expected changes asap and pursue the clustering analysis.

My understanding of the reviewing process so far is: 1) I commit the changes that you requested and re-request a review until we are fine with the changes on both side and then I can close the pull request / merge the commit into the AlexsLemonade:main. 2) before starting a new part of the analysis (like step 2 clustering), I initiate a new issue to describe the plan. Once I am done with the analysis I submit another pull request linked to this new issue. And back to 1) Did I get it right?

Regarding the maintenance of the Dockerfile, thank you very much for your offer. I would like to try to do it, but if it starts being double work from your side checking and advising on it than maintaining it, please just let me know!

Thank you again.

jaclyn-taroni commented 3 months ago

I've looked at the commit history locally. One way to "remove" the clustering changes from this pull request would be to have you create a new branch and then refile the pull request (i.e., you close this one, and we start a new one).

The way you could do that is with the following steps.

First, you'd make sure you're on your main branch:

git checkout main

Then you're going to create a new branch (here I've called it start-wilms-analysis) at a place in the Git history before you added the clustering analysis:

git checkout -b start-wilms-analysis b754e5de88d7ec9d99be0b50db00e34d0b183a4b

Then you can push the new branch to GitHub with:

git push -u origin start-wilms-analysis

Then, you can file a new pull request using the new branch (start-wilms-analysis) from the GitHub UI.

You could largely copy and paste your initial comment when you file, and this closed PR here would retain the record of our conversation.

For now, I think we could plan to leave #680 as is and just make sure we don't merge it until the new PR goes into AlexsLemonade/main.

What do you think of this plan, @maud-p?

maud-p commented 3 months ago

Sounds good thank you @jaclyn-taroni for the precise steps :) I'll do it in a minute.

Regarding the #680 I think I find a way to add my commit to the maud-p-01-clustering branch now !

maud-p commented 3 months ago

one question @jaclyn-taroni , for the next step, each time I start a new step in the analysis, I should:

generate a new issue
generate a new branch and work on it Correct?

jaclyn-taroni commented 3 months ago

one question, for the next step, each time I start a new step in the analysis, I should:
* generate a new issue

* generate a new branch and work on it
  Correct?

Yes, that's right! More completely:

New issue
New branch
New pull request from the branch

So we're aiming for 1 issue:1 pull request, but it does not always work out that cleanly. If an issue is particularly "big" (like it will require two+ scripts or notebooks to accomplish), you might end up with something that looks like:

New issue
First branch
First pull request from the first branch
Second branch is created from the first branch
Second pull request from the second branch

And that's totally okay!

I think the most important takeaways are that your reviewers have enough information to do a good job with their review (e.g., context about your scientific goals), and the pull requests are a manageable size. ~400 lines that need to be reviewed or one script/notebook are some rules of thumb you can use for what is a "manageable size."

jaclyn-taroni commented 3 months ago

Closing in favor of #681.

AlexsLemonade / OpenScPCA-analysis