chanzuckerberg / miniwdl

Workflow Description Language developer tools & local runner
MIT License
173 stars 54 forks source link

post-task chown failed: {'Error': None, 'StatusCode': 123} #618

Open cademirch opened 1 year ago

cademirch commented 1 year ago

Seems like a duplicate of #404

I am trying to run the viral workflow from the docs.

However the workflow is failing with this error post-task chown failed: {'Error': None, 'StatusCode': 123}

Here is the whole error.json from _LAST:

{
  "error": "RunFailed",
  "task": "align_reads",
  "run": "call-align_to_ref-0",
  "dir": "/public/home/cmirchan/viral-pipelines-2.1.0.2/20221205_143811_assemble_refbased/call-align_to_ref-0",
  "cause": {
    "error": "AssertionError",
    "message": "post-task chown failed: {'Error': None, 'StatusCode': 123}",
    "run": "call-align_to_ref-0",
    "dir": "/public/home/cmirchan/viral-pipelines-2.1.0.2/20221205_143811_assemble_refbased/call-align_to_ref-0",
    "pos": {
      "source": "/public/home/cmirchan/viral-pipelines-2.1.0.2/pipes/WDL/tasks/tasks_assembly.wdl",
      "line": 243,
      "column": 1
    }
  },
  "pos": {
    "source": "/public/home/cmirchan/viral-pipelines-2.1.0.2/pipes/WDL/tasks/tasks_assembly.wdl",
    "line": 243,
    "column": 1
  },
  "traceback": [
    "Traceback (most recent call last):",
    "  File \"/public/home/cmirchan/miniconda3/envs/wdl-dev/lib/python3.10/site-packages/WDL/runtime/task.py\", line 202, in run_local_task",
    "    _try_task(cfg, task, logger, container, command, terminating)",
    "  File \"/public/home/cmirchan/miniconda3/envs/wdl-dev/lib/python3.10/site-packages/WDL/runtime/task.py\", line 586, in _try_task",
    "    return container.run(logger, command)",
    "  File \"/public/home/cmirchan/miniconda3/envs/wdl-dev/lib/python3.10/site-packages/WDL/runtime/task_container.py\", line 318, in run",
    "    exit_code = self._run(logger, terminating, command)",
    "  File \"/public/home/cmirchan/miniconda3/envs/wdl-dev/lib/python3.10/site-packages/WDL/runtime/backend/docker_swarm.py\", line 311, in _run",
    "    self.chown(",
    "  File \"/public/home/cmirchan/miniconda3/envs/wdl-dev/lib/python3.10/site-packages/WDL/runtime/backend/docker_swarm.py\", line 583, in chown",
    "    and chowner_status.get(\"StatusCode\", -1) == 0",
    "AssertionError: post-task chown failed: {'Error': None, 'StatusCode': 123}"
  ]
}

I am new to wdl and miniwdl so I'm not quite sure how to debug this. To me this seems like a permissions issue, but I'm not sure where its coming from as miniwdl doesn't seem to have problems reading and writing files in this directory.

Edit: More info

2nd Edit: Ran on another server successfully: I was able to run the workflow on fresh cloud instance (Ubuntu 22.04) without this error. This further suggests to me permissions issues on the troublesome server as that is a university maintained machine. Would appreciate advice to solve/debug this!

mlin commented 1 year ago

@cademirch Thanks for the report...I don't know exactly what the problem is, but I can describe what the "post-task chown" is meant to do, to give more context that may help narrow it down:

Processes running inside Docker containers very often run as root (uid=0), and as a result, any output files they leave behind on the host filesystem will be owned by root. This is annoying in the common case that the user isn't otherwise routinely operating as root, because they're left with output files that they can't rename or delete unless they sudo. To avoid this, miniwdl makes each task container chown all its output files to be owned by the invoking user id, as a postprocessing step in task execution.

The error indicates that the OS rejected this attempt to chown the container's output files to the invoking user id. So, probably there's something in the configuration of the OS or (shared?) filesystem that prevents chowning files between users (even when running as root?). Does that seem plausible?

Singularity and udocker often conform more naturally to these kinds of constraints often imposed in HPC environments.

cademirch commented 1 year ago

Hi @mlin, thanks for your detailed reply. I'll try contacting our sysadmin... I'm not sure what it could be. If the chown is happening in the container then I'm not sure what permission issues could be blocking that.