MAAP-Project / Community

Issue for MAAP (Zenhub)
2 stars 1 forks source link

Proposal: Relationship between Workspaces and Algorithms #522

Open gchang opened 2 years ago

gchang commented 2 years ago

Definitions:

Base Image: The build blocks of the MAAP ADE and DPS. These are minimal Docker images that may contain discipline specific libraries and tools that are persisted within MAAP. ADE Image: This is a docker image that consists of the Base Image + UI. The most common UI is Jupyterlab. ADE Workspace: This is the running instantiation of the ADE Image accessible via web browser.

Algorithm: A scientific transformation of data Algorithm Source Code: This is implementation of the algorithm as a script, executable, or notebook. Algorithm Registration: The record within the DPS defining how the Algorithm source code is run. Algorithm or Runtime Image: This is the Base Image + Algorithm Source Code. Algorithm or Runtime Container: This is the running instatiation of the Algorithm or Runtime Image. DPS Job: A single instance of an Algorithm or Runtime Container with specific inputs and outputs

Algorithm source code is stored as Git repositories within MAAP. The current design of the DPS is that there can only be one algorithm executable per Git repo. This is because DPS uses the name of the Git repository as an identifier for the algorithm. As MAAP is being used more and more, this assumption is no longer safe as scientists now have multiple executables per Git repo.

Scientists may have multiple steps in their processing workflow, i.e. steps 1, 2, 3, etc. Each step has its own executables and possibly its own set of dependencies. One of the biggest problems they've encountered is when these separate steps are registered within DPS. When the step 1 algorithm gets registered, an Algorithm Image is created with the Git repo's name. When the step 2 algorithm gets registered, another Algorithm Image is created. Because it comes from the Git repo, it uses the same identifier and thus overwrites the Algorithm Image for step 1. When the scientist wants to run step 1 again, it uses the image created for step 2 and thus fails.

Possible Solutions:

  1. Separate out each step as its own Git repo
    This is not ideal because scientists consider the entire workflow of multiple steps as a cohesive unit. When publishing papers, they don't want to refer people to multiple Git repos.
  2. Create a separate Algorithm Image for each executable within a Git repo
  3. Create a single Algorithm Image but have separate Algorithm Registrations

Other considerations:

Identifier Collisions:

  1. Authorship Collisions
    How would we handle authorship collisions especially with when users are collaborating on an algorithm in the same workspace?

  2. Identifier Collisions
    What happens when a user forks the Git repo, which would allow it to keep the same repository name? If users are allowed to provide their own identifier, how do we prevent collisions?

gchang commented 2 years ago

Learn from ECOSML team(?)