Open sjspielman opened 1 week ago
We can actually potentially remove these notebooks from the repo and send them to results instead, since they are not part of the actual annotation pipeline. In this case, I would take those steps in the third PR.
Are you referring to the content of the notebook
folder? From what I can tell, the notebook
folder only has HTML files which tells me they were rendered using template notebooks that live somewhere (presumably notebook_template
?) I think when we have a collection of rendered notebooks like this that are a result of rendering a template notebook across multiple samples they should live in the results
directory and not on Github. Unless there was a specific reason they were included here in the first place, I think I would favor the option of moving them all together.
We can actually potentially remove these notebooks from the repo and send them to results instead, since they are not part of the actual annotation pipeline. In this case, I would take those steps in the third PR.
This is specifically referring to the notebooks notebook_template/05_copykat_exploration.Rmd
and notebook_template/06_cnv_infercnv_exploration.Rmd
. These produce output for 5 samples each, stored in notebook/SCPCS000***
for the given sample. These are exploratory steps which do not directly contribute to annotation, and the inferCNV one in particular takes a lot of computational resources/time to regenerate.
I think when we have a collection of rendered notebooks like this that are a result of rendering a template notebook across multiple samples they should live in the results directory and not on Github.
In fact, this situation describes all notebooks in notebook_template
, and all of their outputs get stored in notebook
. I am not going to argue at all with sending their output to results
instead...! Edit (pressed enter too soon): In this case, we'd update code to change output paths, and remove all those HTMLs from the repo.
I think as long as we have a copy of the code that was used to generate the rendered copies of the notebook (as in the Rmd files that live in notebook_template
and then the code used to actually do the rendering) then the actual HTML files do not need to live in Github and should be moved to results.
In this case, we'd update code to change output paths, and remove all those HTMLs from the repo.
This seems like the move to me. I would just update any paths to save to results
instead of notebook
.
I love this move!
I actually disagree here. I think having out-of-date notebooks in an exploratory folder is generally fine, as long as there is documentation to say they are out of date. They are a record of analysis that was done, and having them available in the repository seems important for outside transparency. I would expect items in the results
folder to stay up to date, so I would not move them there.
I am not sure we need to rerun the notebooks with all exploratory steps at all? I would simply remove them from the workflow. If they really do need to be rerun, then we should be able to spin up enough compute to make it work, even if we have to do it outside Lightsail.
I think having out-of-date notebooks in an exploratory folder is generally fine, as long as there is documentation to say they are out of date. They are a record of analysis that was done, and having them available in the repository seems important for outside transparency. I would expect items in the results folder to stay up to date, so I would not move them there.
I would agree that if they are actually exploratory analysis then they don't need to be up to date, but it looks like they are rendered using results from the workflow. At least that's what steps are listed in the README:
I see an argument for being exploratory analysis and results, but I will say in the Ewing module I don't store any copies of notebooks where we "explore" results across multiple samples in the actual repo. I only have the original copy and a rendered copy for one sample.
If we are going to label these as exploratory notebooks then I would agree they need to be removed from the workflow and don't need to be up to date if they are in the repo. But if that's the case then I would create a separate folder for exploratory_analysis
or exploratory_notebook
to make it clear which notebooks are which. I see at the bottom of the README for the notebook
folder we have a section for Exploratory analysis, so I might just pull that into it's own folder.
but it looks like they are rendered using results from the workflow.
Yes, they are rendered with results from the workflow, but they are part of an optional step in the workflow that performs exploratory analyses that do not directly contribute to the results. There are a few other notebooks like this: 03 which explores clusterings, 04 which explores label transfer results, and 07 which does the annotation. At a minimum, we should be keeping 07 and its HTML output in the repo since it's doing the annotation.
Many notebooks in this module end up just inflating the repo size excessively without contributing much information, too:
Many notebooks in this module end up just inflating the repo size excessively without contributing much information, too
Unfortunately, it is too late now for the size question: the files are still there unless we go through and prune them from the whole history. I agree that the correct move might been to only include a representative set, and not all exploratory notebooks, but we can't really solve the size issue retrospectively.
Here's how I'm going to approach this:
This issue is consolidating #875 as well as new todo items for
cell-type-wilms-tumor-06
. Currently, notebooks in this module are out-of-date.Why are notebooks out of date?
Issue #875
New items
Additional notebooks are/will be out of date due to these currently-in-progress changes:
03_clustering_exploration.Rmd
00b_characterize_fetal_kidney_reference_Stewart.Rmd
,02a_label-transfer_fetal_full_reference_Cao.Rmd
, and02b_label-transfer_fetal_kidney_reference_Stewart.Rmd
Agenda
We'll want to re-run the module in full to regenerate notebooks, including all exploratory steps. It makes the most sense to wait until the next data release (currently planned for late November: https://github.com/AlexsLemonade/OpenScPCA-admin/issues/286) is out.
That said, these diffs make reviewing hard, because GitHub UI does not even want to tell how which/how many files changed. When committing these, we'll therefore either have to do it in chunks with several PRs, or the reviewer can use GitKraken to review instead. In my experience, GitKraken will give a full diff view that GitHub will not render.