dchaley / deepcell-imaging

Tools & guidance to scale DeepCell imaging on Google Cloud Batch
8 stars 2 forks source link

Add parameter for batch job configuration #284

Closed dchaley closed 3 months ago

dchaley commented 3 months ago

Runtime environments (aka google projects) have a bunch of configuration that "never" changes no matter the type of job being run.

For example: network settings, service accounts, allowed regions/zones…

We need to be able to easily configure these without requiring users to manually edit the base job JSON data.

Some options

Option 1: CLI parameters

./run-multistep-job.py --network "foo" --subnetwork "bar" --service_account "email"

While this is easiest to implement, it's rather awkward & repetitive. It makes it that much harder to run a simple job.

Option 2: environment variables

In this case, the script would inspect the environment for specific variables such as $NETWORK and $SERVICE_ACCOUNT.

This is cleaner than parameters, but still requires additional environment setup (no pun intended).

Option 3: configuration file

Pass a config filename on the command-line, and merge that into the base batch job json.

./run-multistep-job.py --config "config.json"

Then we have, in config.json something like:

{
  "allocationPolicy": {
    "location": {
      "allowedLocations": [
        "region/us-central1"
      ]
    },
    "network": {
      "networkInterfaces": [
        {
          "network": "foo",
          "subnetwork": "bar",
          "noExternalIpAddress": true
        }
      ]
    },
    "serviceAccount": {
      "email": "a@b.com"
    }
  }
}

Note that we're following the batch structure exactly, so that this just merges in. It also means users can parameterize anything they want that Batch supports (even if we don't yet support options for it in our scripts) by just editing the config.

Decision

Let's go with option 3: the config file.