Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
483 stars 194 forks source link

Newly frequent WorkQueueTaskFailure in CI #2914

Open benclifford opened 11 months ago

benclifford commented 11 months ago

Describe the bug

I'm seeing this WorkQueueExecutor heisenbug happen in CI a lot recently: I'm not clear what has changed to make it happen more - for example in https://github.com/Parsl/parsl/actions/runs/6518865549/job/17704749713

ERROR    parsl.dataflow.dflow:dflow.py:350 Task 207 failed after 0 retry attempts
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/parsl/dataflow/dflow.py", line 301, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/parsl/parsl/parsl/dataflow/dflow.py", line 571, in _unwrap_remote_exception_wrapper
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
parsl.executors.workqueue.errors.WorkQueueTaskFailure: ('work queue result: The result file was not transfered from the worker.\nThis usually means that there is a problem with the python setup,\nor the wrapper that executes the function.\nTrace:\n', FileNotFoundError(2, 'No such file or directory'))
INFO     parsl.dataflow.dflow:dflow.py:1390 Standard output for task 207 available at std.out

I'm don't have any immediate strong ideas about what is going on - I've had a little poke but can't see anything that sticks out right away.

I've opened:

I haven't been successful in recreating this on my laptop. However I have seen a related error on perlmutter under certain high load / high concurrency conditions which is a bit more recreatable and maybe I can debug from there.

cc @dthain

benclifford commented 11 months ago

maybe related, maybe not, I've also seen this in CI - it looks something to do with staging files in, not out? see https://github.com/Parsl/parsl/actions/runs/6519478342/job/17706018626

E               parsl.executors.errors.BadStateException: Executor WorkQueueExecutor failed due to: Error 1:
E                   EXIT CODE: 139
E                   STDOUT: Found cores : 2
E               Launching worker: 1
E               work_queue_worker: creating workspace /tmp/worker-1001-5848
E               work_queue_worker: using 2 cores, 6932 MB memory, 18382 MB disk, 0 gpus
E               connected to manager fv-az201-276:9000 via local address 10.1.0.39:38854
E               
E                   STDERR: Network function: connection from ('127.0.0.1', 50818)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50824)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50828)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50834)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 40740)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 40756)
E               Network function: recieved event: {'fn_
E               ...
E               ': 'direct'}
E               Network function: connection from ('127.0.0.1', 38228)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 38236)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function encountered exception  [Errno 2] No such file or directory: 't.271'
E               Traceback (most recent call last):
E                 File "/opt/hostedtoolcache/Python/3.8.18/x64/bin/parsl_coprocess.py", line 141, in <module>
E                   main()
E                 File "/opt/hostedtoolcache/Python/3.8.18/x64/bin/parsl_coprocess.py", line 69, in main
E                   task_id = int(input_spec[1])
E               IndexError: list index out of range
E               /home/runner/work/parsl/parsl/runinfo/003/submit_scripts/parsl.WorkQueueExecutor.block-0.1697310662.3360648.sh: line 10:  5848 Segmentation fault      (core dumped) PARSL_WORKER_BLOCK_ID=0 work_queue_worker --coprocess parsl_coprocess.py fv-az201-276 9000
benclifford commented 11 months ago

I've tried my DESC development branch of parsl with ndcctools 7.7.0 and still experience sporadic FileNotFound errors as reported in the main body of this issue.

dthain commented 11 months ago

So that error is almost certainly coming from this line, where the coprocess attempts to chdir to the task directory (t.271) corresponding to the function-call task: https://github.com/cooperative-computing-lab/cctools/blob/master/poncho/src/poncho/wq_network_code.py#L75

Now, it's hard for me to imagine that the directory does not really exist bc/ the worker creates it before sending the function to the coprocess. But, it would be wise for the coprocess to check this and send back an error message.

But, I think the problem is really that the coprocess doesn't do the complementary chdir(..) under all exit paths. For example, if the coprocess catches an exception, it skips the .. on the way out. So I think we need a more idempotent approach to always return to the same absolute directory each time through the loop.

@tphung3 what do you think?

tphung3 commented 11 months ago

@benclifford I just merged a fix to the chdir error (see https://github.com/cooperative-computing-lab/cctools/pull/3542), what's the quickest way to see if it works?

benclifford commented 11 months ago

@tphung3 if you have a URL for a binary of cctools (from anywhere, doesn't need to be an official release) it is hopefully easy to make a branch of parsl, edit the install path for ndcctools, hack the dependency problem mentioned elsewhere and see what happens

dthain commented 11 months ago

See draft release here with fix included: https://github.com/cooperative-computing-lab/cctools/releases/download/untagged-d48ba4a269ffbfaf30dc/cctools-7.7.0.parsl.wq.chdir.fix-x86_64-ubuntu20.04.tar.gz

dthain commented 11 months ago

Correction: https://github.com/cooperative-computing-lab/cctools/releases/tag/parsl-wq-chdir-fix

benclifford commented 10 months ago

On the desc parsl branch, I'm still seeing some segfaults and other work queue problems, for example here:

https://github.com/Parsl/parsl/actions/runs/6668296438/job/18123571251?pr=2012#step:6:9134

I don't have a feel for if this is something that is breaking in the parsl branch-specific functionality which is then breaking things in WQ, or what else is going on - so I'm just noting that error here for now.

dthain commented 10 months ago

It looks like this test is running cctools 7.7.1, but the fix for that segfault is in 7.7.2: https://github.com/cooperative-computing-lab/cctools/releases/tag/release%2F7.7.2

benclifford commented 10 months ago

ok, easy to bump that branch up by 0.0.1 - I'll do that now

benclifford commented 10 months ago

I'm still seeing this in the desc branch of parsl in CI sometimes:

Network function: connection from ('127.0.0.1', 60014)
Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result', 'log'], 'remote_task_exec_method': 'direct'}
Network function: connection from ('127.0.0.1', 60020)
Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result', 'log'], 'remote_task_exec_method': 'direct'}
Network function encountered exception  [Errno 2] No such file or directory: 't.107'
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/.venv/bin/parsl_coprocess.py", line 135, in <module>
    main()
  File "/home/runner/work/parsl/parsl/.venv/bin/parsl_coprocess.py", line 68, in main
    task_id = int(input_spec[1])
IndexError: list index out of range
/home/runner/work/parsl/parsl/runinfo/003/submit_scripts/parsl.WorkQueueExecutor.block-0.1699912866.767312.sh: line 10:  6144 Segmentation fault      (core dumped) PARSL_WORKER_BLOCK_ID=0 work_queue_worker --coprocess parsl_coprocess.py fv-az340-503 9000

https://github.com/Parsl/parsl/actions/runs/6854767971/job/18642922623?pr=2012#step:7:1883

This is with CCTOOLS_VERSION=7.7.2

dthain commented 10 months ago

Hmm, that is surprising -- @tphung3 will look into it. We are Supercomputing in Denver this week, may be a bit delayed.

dthain commented 9 months ago

Ok, I think we see where the problem is, let me bring in @colinthomas-z80 who is going to sort things out.

colinthomas-z80 commented 9 months ago

It appears this was fixed in the cctools library code but didn't get moved over here. See above PR

colinthomas-z80 commented 9 months ago

Would it be feasible to include the generation of parsl_coprocess.py somewhere in the build process?

benclifford commented 9 months ago

I would like that. I don't know enough about Python build/install to know how to do it, but some packages manage to compile C, etc so I'll guess it's possible.

dthain commented 7 months ago

Let's do this generation at runtime by running poncho_package_serverize appropriately, which is what we do in native TaskVine applications.