AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
9 stars 17 forks source link

Regenerate notebooks for cell-type-wilms-tumor-06 #906

Open sjspielman opened 1 week ago

sjspielman commented 1 week ago

This issue is consolidating #875 as well as new todo items for cell-type-wilms-tumor-06. Currently, notebooks in this module are out-of-date.

Why are notebooks out of date?

Issue #875

This issue tracks updating the copyKAT and inferCNV exploratory notebooks in the cell-type-wilms-tumor-06 module, once code being developed towards #810 has been completed and merged into main.

See here for some additional context: https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/873#issue-2649770493

I cannot fully render these notebooks (though I confirmed they run with test data!) due to compute limitations, so I see these as the best two ways to handle them:

  • We can actually potentially remove these notebooks from the repo and send them to results instead, since they are not part of the actual annotation pipeline. In this case, I would take those steps in the third PR.
    • We can probably ask Maud to regenerate these notebooks once we merge all this code into main. In this case, I would file an issue for followup.

New items

Additional notebooks are/will be out of date due to these currently-in-progress changes:

We'll want to re-run the module in full to regenerate notebooks, including all exploratory steps. It makes the most sense to wait until the next data release (currently planned for late November: https://github.com/AlexsLemonade/OpenScPCA-admin/issues/286) is out.

That said, these diffs make reviewing hard, because GitHub UI does not even want to tell how which/how many files changed. When committing these, we'll therefore either have to do it in chunks with several PRs, or the reviewer can use GitKraken to review instead. In my experience, GitKraken will give a full diff view that GitHub will not render.

allyhawkins commented 1 week ago

We can actually potentially remove these notebooks from the repo and send them to results instead, since they are not part of the actual annotation pipeline. In this case, I would take those steps in the third PR.

Are you referring to the content of the notebook folder? From what I can tell, the notebook folder only has HTML files which tells me they were rendered using template notebooks that live somewhere (presumably notebook_template?) I think when we have a collection of rendered notebooks like this that are a result of rendering a template notebook across multiple samples they should live in the results directory and not on Github. Unless there was a specific reason they were included here in the first place, I think I would favor the option of moving them all together.

sjspielman commented 1 week ago

We can actually potentially remove these notebooks from the repo and send them to results instead, since they are not part of the actual annotation pipeline. In this case, I would take those steps in the third PR.

This is specifically referring to the notebooks notebook_template/05_copykat_exploration.Rmd and notebook_template/06_cnv_infercnv_exploration.Rmd. These produce output for 5 samples each, stored in notebook/SCPCS000*** for the given sample. These are exploratory steps which do not directly contribute to annotation, and the inferCNV one in particular takes a lot of computational resources/time to regenerate.

I think when we have a collection of rendered notebooks like this that are a result of rendering a template notebook across multiple samples they should live in the results directory and not on Github.

In fact, this situation describes all notebooks in notebook_template, and all of their outputs get stored in notebook. I am not going to argue at all with sending their output to results instead...! Edit (pressed enter too soon): In this case, we'd update code to change output paths, and remove all those HTMLs from the repo.

allyhawkins commented 1 week ago

I think as long as we have a copy of the code that was used to generate the rendered copies of the notebook (as in the Rmd files that live in notebook_template and then the code used to actually do the rendering) then the actual HTML files do not need to live in Github and should be moved to results.

In this case, we'd update code to change output paths, and remove all those HTMLs from the repo.

This seems like the move to me. I would just update any paths to save to results instead of notebook.

sjspielman commented 1 week ago

I love this move!

jashapiro commented 1 week ago

I actually disagree here. I think having out-of-date notebooks in an exploratory folder is generally fine, as long as there is documentation to say they are out of date. They are a record of analysis that was done, and having them available in the repository seems important for outside transparency. I would expect items in the results folder to stay up to date, so I would not move them there.

I am not sure we need to rerun the notebooks with all exploratory steps at all? I would simply remove them from the workflow. If they really do need to be rerun, then we should be able to spin up enough compute to make it work, even if we have to do it outside Lightsail.

allyhawkins commented 6 days ago

I think having out-of-date notebooks in an exploratory folder is generally fine, as long as there is documentation to say they are out of date. They are a record of analysis that was done, and having them available in the repository seems important for outside transparency. I would expect items in the results folder to stay up to date, so I would not move them there.

I would agree that if they are actually exploratory analysis then they don't need to be up to date, but it looks like they are rendered using results from the workflow. At least that's what steps are listed in the README:

Screenshot 2024-11-21 at 9 26 45 AM

I see an argument for being exploratory analysis and results, but I will say in the Ewing module I don't store any copies of notebooks where we "explore" results across multiple samples in the actual repo. I only have the original copy and a rendered copy for one sample.

If we are going to label these as exploratory notebooks then I would agree they need to be removed from the workflow and don't need to be up to date if they are in the repo. But if that's the case then I would create a separate folder for exploratory_analysis or exploratory_notebook to make it clear which notebooks are which. I see at the bottom of the README for the notebook folder we have a section for Exploratory analysis, so I might just pull that into it's own folder.

sjspielman commented 6 days ago

but it looks like they are rendered using results from the workflow.

Yes, they are rendered with results from the workflow, but they are part of an optional step in the workflow that performs exploratory analyses that do not directly contribute to the results. There are a few other notebooks like this: 03 which explores clusterings, 04 which explores label transfer results, and 07 which does the annotation. At a minimum, we should be keeping 07 and its HTML output in the repo since it's doing the annotation.

Many notebooks in this module end up just inflating the repo size excessively without contributing much information, too:

jashapiro commented 6 days ago

Many notebooks in this module end up just inflating the repo size excessively without contributing much information, too

Unfortunately, it is too late now for the size question: the files are still there unless we go through and prune them from the whole history. I agree that the correct move might been to only include a representative set, and not all exploratory notebooks, but we can't really solve the size issue retrospectively.

sjspielman commented 2 days ago

Here's how I'm going to approach this: