AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
9 stars 17 forks source link

04_annotation_Across_Samples_exploration #776

Closed maud-p closed 1 month ago

maud-p commented 1 month ago

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/774

This PR is following the discussion from the PR#750, especially: https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/750#pullrequestreview-2310191830

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

In this PR, I am adding one notebook in /notebook/04_annotation_Across_Samples_exploration.Rmd to explore the annotations and label transfers for all of the samples in SCPCP000006.

We integrated all the samples from SCPCP000006 to have a rapid and global view of label transfer. Please note that the integration is not the aim of this PR, this is just a way to display better genes and features.

In order to explore the label transfer results, we look into some marker genes, table and percentages of cells in each annotation groups (from label transfers).

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes, next step would be to run copyKAT

Results

This PR do not contain any result, only a single notebook.

What types of results does your code produce (e.g., table, figure)?

One notebook that explores for all samples at once clustering and label transfer results.

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I updated the renv.lock file, else no specific changes since the last PR :)

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

What about the next step, what do you think how we can run cpyKAT/inferCNV? I think I should start with copyKAT and try to run with and without a reference of normal cells, and try to evaluate what is the impact of the annotation of normal cells on the infered CNV.

Author checklists

Analysis module and review

Reproducibility checklist

maud-p commented 1 month ago

Hi @sjspielman , Thank you for the rapid feedback 🏎️ ! I just reloaded the results to the s3 bucket, might have been a bug during the transfer to s3 because my local results directory does contain the 3 rds objects per samples. I'll plot the different tables and come back to you, thanks for the few lines of codes to extract the metadata :)

maud-p commented 1 month ago

@sjspielman I modified the notebook based on your suggestion, it is quite simple now but let me know if you thing to additional plots and checks that can be useful :)

Thank you again for your help!

maud-p commented 1 month ago

Dear @sjspielman , thank you so much for the review and the suggestions, including codes! Was really useful, as usual!

I should have made all suggested changes :D

One additional table that I would like to discuss is the sumary, for each compartment, of the percentage of cells that do match kidney annotation or not. image

The majority of fetal nephron cell (92%) has been predicted as kidney. However, the other compartments (stroma, endothelial and immune) do not really match to kidney cells. This is in my opinion not a concern and shouldn't be interpreted as a poor label transfer or annotation! I'll try to argue a bit why.

Thank you again for your help!

sjspielman commented 1 month ago

Ah, one more thing I forgot!

We should add rendering this notebook to the 00_run_workflow.R script. This step should be after/outside the for loop (since we only knit this once, not for each sample), and only run if we are not testing (since the input files for this notebook aren't generated in testing).

maud-p commented 1 month ago

I ended up selecting the following samples:

I tried to enriched in samples having >100 normal cells, without decreasing too much the prediction of kidney cells. Would it be OK like this?

Thank you again for your help!!

maud-p commented 1 month ago

Great, thank you very much!

Thank you very much for letting me know about your days off. Then I might take the time to compare few different methods (copyKAT +/- reference, inferCNV +/- reference) on the few samples!

For the workflow, did I understood that it would be best to have:

Should I try in another PR to split the first 3 notebooks into scripts + notebook?

Thank you!

sjspielman commented 1 month ago

For the workflow, did I understood that it would be best to have:

  • R scripts for steps that are building the final object
  • notebooks for reports and results explorations?

Yes, this is the idea. Notebooks are generally the better option for steps that are exploratory or interactive in some way - e.g. making tables and plots. Scripts are often the better option for running an analysis that you plan to explore in a notebook. For example, in the doublet-detection module I wrote, you can see that (for initial benchmarking steps) I used a script to detect doublets, and then a notebook to explore the doublet results. The parallel here would be to run copyKAT in a script which would save the copyKAT results as TSV files (this output doesn't need to be the whole Seurat object - we can save a little storage space with TSV instead!), and then explore those results in a notebook, which might also be a template notebook that looks at one sample at a time with params.

That said, this is not a strict rule - you can still use a notebook to run copyKAT if you feel more comfortable with that approach!

sjspielman commented 1 month ago

Should I try in another PR to split the first 3 notebooks into scripts + notebook?

Don't worry about this at all! The code you have written so far is completely fine. Again, not a strict rule :)