ga4gh / task-execution-schemas

Apache License 2.0
80 stars 28 forks source link

Proposal: Task arrays #60

Open geoffjentry opened 7 years ago

geoffjentry commented 7 years ago

There was a discussion on the mailing list about providing task array functionality, which is a common feature among job schedulers

buchanae commented 7 years ago

One approach would be to set an environment variable describing the index in the task array. This variable would be available when evaluating Executor.cmd

buchanae commented 7 years ago

Related mailing list discussions: https://groups.google.com/a/genomicsandhealth.org/forum/#!topic/ga4gh-cloud/qFq_jgoRCvs https://groups.google.com/a/genomicsandhealth.org/forum/#!topic/ga4gh-cloud/ccaqjysBvZY

jbingham commented 7 years ago

FYI that we're in the processing of adding exactly this functionality into our dsub command-line, setting an env var for the task index. We're doing it client side, since our server impl doesn't support it server side.

buchanae commented 7 years ago

Ideas...

1) Task template + repeat count + loop index env. var.

Roughly:

tpl = {
  "executors": [
    { "cmd": ["echo", "$TASK_INDEX"]},
  ],
  "inputs": [{
    "path": "/path/to/storage",
  }],
}

tes.CreateTaskBatch(tpl, repeat=1000)

2) Task template + repeat count + template variables

Roughly:

tpl = {
  "executors": [
    { "cmd": ["echo", "{% TASK_INDEX %}"]},
  ],
  "inputs": [{
    "path": "/path/to/storage/{% TASK_INDEX %}",
  }],
}

tes.CreateTaskBatch(tpl, repeat=1000)

3) Task template + template variables list

Roughly:

tpl = {
  "executors": [
    { "cmd": ["echo", "{% DRUG_NAME %}"]},
  ],
  "inputs": [{
    "path": "/path/to/storage/{% DRUG_NAME %}",
  }],
}

tes.CreateTaskBatch(tpl, vars=[
  {"DRUG_NAME": "foo"},
  {"DRUG_NAME": "bar"},
  ...thousands of rows here...
])

4) Task template + merging

tpl = {
  "executors": [
    { "environ": { "shared": "foo" } },
  ],
  "resources": {
    "cpus": 10,
  },
  "inputs": [
    {
      "path": "/container/path",
    },
    {
      "path": "/container/path",
    },
  ],
}

tes.CreateTaskBatch(tpl, partials=[
  ... these are partial task messages, each row defining a specific override
  {
    "executors": [
        { "cmd": ["echo", "task1"] },
    ],
    "inputs": [
       { "url": "/path/to/task1/input1.data" },
       { "url": "/path/to/task1/input2.data" },
     ]
  },
  ... thousands of rows here ...
])
buchanae commented 7 years ago

This is a duplicate of #55 right? If so, I recommend we close #55