RMI-PACTA / practices

https://rmi-pacta.github.io/practices/
2 stars 0 forks source link

The `workflow` and `package` paradigm #2

Open jdhoffa opened 1 year ago

jdhoffa commented 1 year ago

Packages are used to keep track of modular code that we intend to re-use multiple times, and want to have unit-tested and portable.

A template PACTA R package repository can be found here: https://github.com/RMI-PACTA/pacta.r.package

Workflows are much less structured, and are used to keep track of any kind of data analysis or data science type pipelines.

A template PACTA workflow repository can be found here (with some useful guidance on how to conduct reproducible data science): https://github.com/RMI-PACTA/workflow.template.pacta

I made both of these template repos with little to NO input from anyone, and they include all of my bull-headed assumptions.

It would be cool to align and document our ideas of useful practices for either type of repo

AlexAxthelm commented 1 year ago

is the general idea that most of our repos would fall into one of these categories?

jdhoffa commented 1 year ago

It is! Or at least, most repos that are directly or tangentially related to "data analysis" and "tools built on data analysis"

As usual, we should not feel restricted to only ever base things/ try to fit things into these categories that don't fit, but rather utilize them as a common basis for styles of tasks that we find ourselves repeating

jdhoffa commented 7 months ago

Consider adding a build.* type repository. FYI. @cjyetman

AlexAxthelm commented 7 months ago

What would be in the build.* repos?

cjyetman commented 7 months ago

my thoughts on this were:

pacta.* - primary output is an R package workflow.* - primary output is files, plots, data, etc. (irrelevant if an ephemeral Docker container is used to generate that output) build.* - primary output is a Docker image (obviously, that Docker image will be used somewhere else at sometime by someone probably to output some other files, but that's not the primary goal of the repo)

edit: build.* could probably also contain anything that builds a self-contained app or software

AlexAxthelm commented 7 months ago

At first thought, this seems like an anti-pattern, since any build steps should live close to the code they're building (in-repo, implemented by a CI/CD pipeline). The workflow.* repos seem to be the right place to collect multiple dependent repos into a "thing that does something" which would be built in-repo (https://github.com/RMI-PACTA/workflow.factset being an example of that), but I may be missing a use case.

jdhoffa commented 7 months ago

Tech Review and/ or Dev Hangout topic.

I had a similar "I'm not sure about this" feeling initially, but CJ and I discussed it and I tend to keep wobbling back and forth on the topic to be honest. Would be good to have a focused discussion on it I think.

cjyetman commented 7 months ago

I'd like to align on this soon because the two "workflows" I maintain, workflow.transition.monitor and workflow.data.preparation, are very different animals with very different goals.

jdhoffa commented 7 months ago

I would suggest that we follow a similar process to how we collectively brainstormed this document: https://github.com/RMI-PACTA/practices/pull/12

That is:

jdhoffa commented 7 months ago

If we are happy with this approach, and we don't have a topic for this week's Dev Hangouts I would be happy to already lead it?

I was waffling over if this should be a Dev Hangout thing or a Tech Review thing, but it feels more "dev hangouts" oriented since it's more process and not really a technical implementation decision/ doesn't really touch actual code still.

jdhoffa commented 7 months ago

Oh jeez how am I ever going to summarize that DevHangouts discussion XD

jdhoffa commented 7 months ago

Well here goes:

First Draft

Prefixes for Repositories

There seemed to be some agreement that using the pacta.* prefix for R packages that generally export functions or data-frames and have final outputs that are generally data-frames and/ or plots was useful. Likely pacta.* is not the BEST prefix, perhaps something like rpkg or rmipkg is more useful. Actual prefix TBD. However, there was much less agreement on what repositories with the workflow.* define. Broadly, it seems that they should deal more with file reading, I/O, and processes, calling package functions with various parameters. The question of if they should/ shouldn't include a Dockerfile, a build/ directory, or otherwise Build steps (or if this should be abstracted) was hotly debated and not agreed on.

Clarity of Purpose

@cjyetman finds that the prefix system adds clarity to the purpose and conditions for creating repositories. It helps him more clearly understand and decide when it is a good time to split off a new repo.

Implementation Across the Team

The system of prefixes has started spreading across the team, with Jacob and others creating workflow repos. Especially as the team builds more products, it can be useful to have a prefix to quickly assess what the goals of the repository may be.

build.* Repos

@cjyetman expressed a desire for a new prefix, build, to differentiate between repos that output a collection of files and those that primarily create Docker images.

Prefix Definitions and Implications

There's debate on what the prefix implies, like file restrictions, language use, and what can or cannot be included in a repo. No total agreement here yet.

Maintainership and Scope

@cjyetman (and I imagine also @MonikaFu and @jacobvjk ) are more comfortable working on R code and maintaining R packages without needing to focus on Docker or other environments. He prefers to develop in RStudio, which has implications around the repositories he maintains. The primary implication is that, they MUST preserve an RStudio-centric API (does not necessarily preclude building an additional API on top, e.g. through a Dockerfile, but crucially the RStudio API must be supported.

Constraints of Docker/VM Implementation

There are challenges in maintaining workflows compatible with local environments and Docker/VM environments. The main challenge is that the people working on the repo must commit to doing it! This relates to #12 , but I think it is clear that the key developer responsible for what API is supported is the maintainer.

Collaboration and Decision-Making

The team discussed the need for collaboration when making changes to ensure they don't disrupt other users' workflows or the integrity of the process, especially in the context of supported primary ways of engaging with a repo (be it through RStudio, Docker/ Azure, or either/ both). @jdhoffa comments here: I think the main API of the repo should be decided, wherever possible, by the user or use-case and not the developer. Our goal is to build useful products, not necessarily to build products that we like developing (but whenever possible, we should try to achieve both).

Methodological Decisions in Workflows

Workflow repos may contain significant methodological decisions that impact how data is processed or handled. However this isn't ideal. R packages contain useful documentation systems built-in, which makes it an ideal solution for transparently documenting methodology. Wherever possible, key decisions should be extracted FROM a workflow.* repository TO a pacta.* repository (note: final prefix names still TBD).

Operational Reality

The discussion recognizes the operational reality that some workflows cannot be feasibly run locally due to resource constraints and must be run on virtual machines or Azure, due to the local resources available to most developers (AKA @cjyetman has a lot of RAM).

Desired Structure for Workflow Repos

Ideally, workflow repos should handle inputs and outputs and orchestrate steps, calling on R package functions rather than embedding methodological decision-making.

Role of Docker Images

A debated topic

@RMI-PACTA/developers no urgency to this at all, but please review this comment when you get a chance, see if I've misrepresented anything, and comment as you see fit.

cjyetman commented 7 months ago

The discussion recognizes the operational reality that some workflows cannot be feasibly run locally due to resource constraints and must be run on virtual machines or Azure (for some developers)

I think this unnecessarily implies that "running locally" somehow imposes resource constraints that don't otherwise exist, which seems to imply that committing to the ability to run locally is a unique burden, which I don't think is true. I remember a time when we didn't have VMs with adequate memory to run data.prep, which iirc eventually led to the dataprepbigmem VM for instance.

jdhoffa commented 7 months ago

The discussion recognizes the operational reality that some workflows cannot be feasibly run locally due to resource constraints and must be run on virtual machines or Azure (for some developers)

I think this unnecessarily implies that "running locally" somehow imposes resource constraints that don't otherwise exist, which seems to imply that committing to the ability to run locally is a unique burden, which I don't think is true. I remember a time when we didn't have VMs with adequate memory to run data.prep, which iirc eventually led to the dataprepbigmem VM for instance.

Comment updated for clarity

cjyetman commented 7 months ago

One important note about prefixes for R package repo... while it's (probably) technically possible to have a repo name that is different than the R package's release name, I think that's a complication we should try to avoid, which means choosing the repo name has consequences on what the final R package's name is... so we may need to carve out a prefix for R packages just so that we can maintain consistency in the R package names that get released to CRAN, R-universe, or whatever.

cjyetman commented 7 months ago

Ideally, workflow repos should handle inputs and outputs and orchestrate steps, calling on R package functions rather than embedding methodological decision-making.

I think this is mildly dicey... sometimes the input or parameters chosen themselves are methodological choices, e.g. a function that sets a parameterized threshold of when to include or exclude something. Not sure how to get around that.

jdhoffa commented 7 months ago

@cjyetman has brought up an important social point that, at times, he has sometimes felt forced to change something fundamental about one of his workflow.* repos because someone (likely me) has had strong opinions about it, especially regarding the build process. Often those changes came from strong opinions but weren't actually "required" or "urgent". This makes leniency a little ambiguous because it can feel necessary to develop differently or abandon an API that people feel more comfortable with.

Personally, I @jdhoffa recognize and take responsibility for my role in that. I attribute it to two main things:

I think that defining repo maintainers and tech reviews can, to some extent, help the "symptoms" of the above, but solving the actual foundation of the problem requires not just processes, but improving the social atmosphere. In any case, I think that it's learned behavior that can be unlearned if we all put in an effort, and something we can try to codify in this repo.

Relates to:

4 and #5

AlexAxthelm commented 7 months ago

pacta.* prefix for R packages that generally export functions or data-frames and have final outputs that are generally data-frames and/ or plots was useful

I think a simple way to phrase this is "R objects in, R objects out." There is space for packages that are meant to deal with a specific class/type of file (pacta.portfolio.import comes to mind), but generally thinking that this:

Broadly, it seems that [workflow.*] should deal more with file reading, I/O

is not what they should be worrying about.

jdhoffa commented 7 months ago

Thanks @AlexAxthelm, indeed that is a very concise description! I like it.

cjyetman commented 7 months ago

in depth, relevant discussion was had here https://github.com/RMI-PACTA/workflow.transition.monitor/issues/135

AlexAxthelm commented 7 months ago

From @cjyetman in that thread: https://github.com/RMI-PACTA/workflow.transition.monitor/issues/135#issuecomment-1502626835

An additional characteristic that is becoming more clear to me now is...

A "workflow" repo may create an ephemeral Docker instance in order to complete a task.

A "build" repo should create a Docker image that's intended to be used in the future, and possibly in different contexts and/or with different parameters.

An ephemeral Docker instance has less security concerns because it's only ever intended to run once on the machine it was built. A "workflow" repo is for a user that maybe wants to experiment with parameters, but it essentially defines a specific "workflow" for doing a specific task, and therefore it could hypothetically be used in a "build" repo which either defines the parameters (while the "workflow" defines the process) or defines an interface to specify the parameters.

I see this definition of an ephemeral Docker instance as missing the point of docker-based development. But I love seeing the distinction that a workflow can be run locally for development/experimentation.

The thing I see missing in the Docker part of the comment is a distinction between images and containers. Images (the template, defined in the Dockerfile) are not supposed to be ephemeral, or designed to run on only one machine. Containers (images that are actually being executed) are ephemeral (and are expected to be pruned frequently). The biggest value of Docker-based development is a level of certainty that executing the same image on different machines (regardless of the host's environment), with access to the same explicit properties (volume mounts, .env files, etc) results in the same results, so if we're designing images to be run on a single machine, then there's no advantage to encapsulating it, and we could just run on bare metal.


Overall the 2 traits of Docker I see our team as getting value from are:

  1. Distribution: being able to define an execution environment for PACTA, and then run that on "not our machines" (CTM, Azure runners, etc.). Containers can run and produce the same results.
  2. "Freezing our code in amber 🦟 ": Each Docker image we produce encapsulates an entire environment (ignoring access to data) that allows us to reconstruct (and diagnose) a given analysis in the future
cjyetman commented 7 months ago

I'm using "instance" and "container" interchangeably, possibly incorrectly.

So when I say ephemeral Docker instance, both "ephemeral" and "Docker" are being used as adjectives to describe "instance", or possibly in more appropriate language "container".

jdhoffa commented 7 months ago

got it.

jdhoffa commented 7 months ago

I have some further thinking on this, as per a call that I just had with @AlexAxthelm. None of it is ground-breaking, just continually refining our thinking here: (and all of the below takes under the assumption that pacta.* and workflow.* may not be the optimal naming convention here, but I will use it since that is what we currently have:

In an ideal world

pacta.* repositories

would ALWAYS be R packages. These R packages would only include functions, that usually handle R-objects such as:

these packages would VERY RARELY (read: never unless there is an excellent reason to do so) include functions called for their side effects (including file I/O) etc.

workflow.* repositories

These would handle all the rest, including:

On dockerfiles in more detail

I agree with @AlexAxthelm that there should be no distinction between a build and workflow repository. We should rather strive that ALL of the Docker image that is built (by whatever workflow. needs to build it). The goal should always be that the image is re-usable (e.g. should probably only really change if one of the pacta.* packages is updated) and that the inputs (functional arguments, file I/O, etc.) are defined by a config file and volume mt on run-time.

Relating to our maintainer discussion, the handling of HOW the config, file I/O, etc. is served should be the responsibility of the maintainer of that repo.