Reproducibility Brainstorm

gwaybio commented 5 years ago

@shntnu @jccaicedo @MarziehHaghighi

I was thinking about our recent discussion on github reproducibility a bit more. I am wondering about different potential workflows and can think of some additional potentially helpful setups.

First Setup

The first setup is as I described on Friday:

All code and discussion (experiments, tasks, etc.) live in the repo
The code includes a numbered folder (e.g. 0.generate-profiles) that stores code, QC, and profile results.
The remaining repository is structured into various other modules that represent other experiments and/or common themes (e.g. Correlation of the present dataset with alternatives for MOA discovery). Basically any downstream analysis that uses the profiles generated in 0.generate-profiles.

Potential Alternative?

Perhaps a second setup could separate the processing code and downstream analysis into two distinct repositories. This setup could work well for a couple of reasons.

Can easily separate generating profiles from the profile analysis (we would add a link to the README in each repo cross-referencing the other).
This frees up the naming convention of the analysis repos (no more dates in the name 😉)
This could potentially aid in automation. Based on my, albeit limited, experience it seems like a lot of the profile generation is relatively consistent, and the differences are mainly nuance. I wonder if we could setup something that would create a profiling template (much like the handbook), but that is ready to run once initiated. The workflow could be something like "New Repo" --> git clone --> profiling init (and then bash scripts would be auto populated).

Of course, every project is different, and individual decisions are required. (The same goes for storing the profiles in the actual repo! and public/private repo debate too)

gwaybio commented 5 years ago

Also note that I am brainstorming this general idea in this specific repository relating to the STARR grant b/c it is open source

shntnu commented 5 years ago

Thanks for leading this discussion!

I like this idea because of

There are many different analysis one could do, given the same dataset.

So the profile-generation repo would capture everything that we do in the profiling handbook, and nothing else. Ideally, at some point, this repo would only contain the WDL workflow (or equivalent) used to process the data.

The automation question merits a separate discussion, out of scope right now. It certainly is relatively consistent so indeed this is possible, and needs a lot more work to fully automate. But indeed, that's another reason to consider this option.

What's next? Do you want to try this out on this project? @gwaygenomics

gwaybio commented 5 years ago

There are many different analysis one could do, given the same dataset.

Yeah definitely! Also, depending on the size of the profiles specifically, github can handle data versioning. BBBC will store the raw images?

Ideally, at some point, this repo would only contain the WDL workflow (or equivalent) used to process the data.

Depending on the size of the data, I think it could also store processed profiles. Data versioning FTW :tada:

What's next? Do you want to try this out on this project?

yes, lets try it out! Currently, I don't think the profile processing lives here (do we know where it lives?). So it will be natural to use this strategy here.

Another thing to consider is if the analysis should live in the broadinstitute org or the carpenterlab org. I am thinking carpenterlab since (presumably) we have more control over it and it gives the lab more visibility. (there is also a new and nifty transfer issue feature on Github, so an ownership transfer should be relatively painless)

shntnu commented 5 years ago

BBBC will store the raw images?

Not sure yet; ideally IDR, but it isn't easy to directly access images

yes, lets try it out! Currently, I don't think the profile processing lives here (do we know where it lives?). So it will be natural to use this strategy here.

Indeed, I don't see any profile processing notes; you'd need to check with Beth.

Another thing to consider is if the analysis should live

broadinstitute works well especially for collaborative projects

gwaybio commented 4 years ago

note that I transferred this issue over from https://github.com/broadinstitute/profiling-resistance-mechanisms

This repo currently has the closest workflow to what is described above

broadinstitute / image-profiling-workflow-template

Reproducibility Brainstorm #4

First Setup

Potential Alternative?