Workqueue CI tests hang often in CI

benclifford commented 4 years ago

Describe the bug CI fails relatively often on the workqueue testing step. It hangs in the middle of testing with no further output, and the CI eventually times out.

I'm concerned that this is a problem somewhere in the parsl/workqueue stack that will cause problems for other people.

crossref #1228 which is another issue that makes me think something isn't right in the stack.

To Reproduce Watch for sporadic CI hangs.

I haven't tried running this on a non-CI system to see if it happens there too.

Expected behavior CI should never hang.

Environment CI

benclifford commented 4 years ago

cc @dthain

tjdasso commented 4 years ago

Has this been hanging on @bash_apps specifically? From #1228 it seemed as though we were narrowing it down to issues with how wqex handles Bash apps, but I couldn't find the particular error within the executor. I'll try running the tests locally from #1228 and see if I can figure it out

benclifford commented 4 years ago

Hopefully when I see it in CI again, I'll remember to see where it hangs, and note that here.

tjdasso commented 4 years ago

I'm looking deeper into the way that Parsl executors handle @bash_apps. I have been able to run almost any @bash_app successfully, but it looks like the CI fails on certain bash apps such as exit 15, which returns the incorrect error code and is why it fails. This is the same with other tests like exit 3.141. When I run the command with a bash script, I get the correct error code, but trying to run exit 15 at the command line exits my terminal session. Perhaps the way that WQEx handles bash apps is what is causing the issue. Could you explain how the string returned in a bash app function is actually executed within Bash by a worker -- does it take that string and pass it directly to bash, and collect the results?

benclifford commented 4 years ago

In parsl/app/bash.py there is a function: remote_side_bash_executor.

This function is what runs on the python side (roughly if it was a @python_app) and does bash_app related stuff. This is what will be running inside the workqueue worker.

It runs the bash_app function (on the worker) to generate the bash command line to run:

         executable = func(*args, **kwargs)

and then executes that executable:

    try:
        proc = subprocess.Popen(executable, stdout=std_out, stderr=std_err, shell=True, executable='/bin/bash')
        proc.wait(timeout=timeout)
        returncode = proc.returncode

Around that is some file handling and logging code.

"exit 15" at a command prompt is exactly what I'd expect - it exits bash which if it is the top level shell in your terminal window will make your window close. Below is a way to see the exit codes, by running a second bash:

top level shell$ bash
inner shell$ exit 15
exit
top level shell$ echo $?
15

benclifford commented 4 years ago

This just happened in a test for PR#1461 on python 3.7: (numbers at the start are the line numbers from travis build log)

$ pytest parsl -k "not cleannet" --config parsl/tests/configs/workqueue_ex.py --cov=parsl --cov-append --cov-report= --random-order --bodge-dfk-per-test
1998============================= test session starts ==============================
1999platform linux -- Python 3.7.1, pytest-4.0.2, py-1.7.0, pluggy-0.8.0
2000Using --random-order-bucket=module
2001Using --random-order-seed=410238
2002
2003rootdir: /home/travis/build/Parsl/parsl, inifile:
2004plugins: typeguard-2.6.1, xdist-1.26.1, random-order-1.0.4, forked-1.1.3, cov-2.8.1
2005collected 180 items / 3 deselected / 5 skipped                                 
2006
2007parsl/tests/test_bash_apps/test_basic.py ..                              [  1%]
2008parsl/tests/test_python_apps/test_pipeline.py ..                         [  2%]
2009parsl/tests/test_python_apps/test_at_scale.py sss                        [  3%]
2010parsl/tests/test_docker_multisite.py s                                   [  4%]
2011parsl/tests/test_bash_apps/test_memoize.py .                             [  5%]
2012parsl/tests/test_python_apps/test_fibonacci_recursive.py .               [  5%]
2013parsl/tests/test_docs/test_workflow3.py .                                [  6%]
2014parsl/tests/test_checkpointing/test_python_checkpoint_2.py s             [  6%]
2015parsl/tests/test_python_apps/test_rand_fail.py sss                       [  8%]
2016parsl/tests/test_regression/test_226.py sss                              [ 10%]
2017parsl/tests/test_staging/test_implicit_staging_https_in_task.py s        [ 10%]
2018parsl/tests/sites/test_worker_info.py s                                  [ 11%]
2019parsl/tests/test_docs/test_workflow2.py s                                [ 11%]
2020parsl/tests/test_docs/test_tutorial_1.py ss                              [ 12%]
2021parsl/tests/test_error_handling/test_htex_worker_failure.py s            [ 13%]
2022parsl/tests/test_data/test_file_ipp.py ...                               [ 15%]
2023parsl/tests/test_python_apps/test_memoize_1.py .                         [ 15%]
2024parsl/tests/test_checkpointing/test_regression_232.py ss                 [ 16%]
2025parsl/tests/test_flowcontrol/test_python_diamond.py s                    [ 17%]
2026parsl/tests/sites/test_local_monitoring.py ss                            [ 18%]
2027parsl/tests/test_threads/test_lazy_errors.py s                           [ 19%]
2028parsl/tests/test_aalst_patterns.py ssssssssssssssssssssssssss            [ 33%]
2029parsl/tests/test_python_apps/test_basic.py s....                         [ 36%]
2030parsl/tests/test_regression/test_97.py s                                 [ 37%]
2031parsl/tests/test_bash_apps/test_error_codes.py .s.sss                    [ 40%]
2032parsl/tests/test_thread_parallelism.py ss                                [ 41%]
2033parsl/tests/test_error_handling/test_retries.py sss                      [ 43%]
2034parsl/tests/test_bash_apps/test_multiline.py .                           [ 44%]
2035parsl/tests/test_error_handling/test_fail.py .                           [ 44%]
2036parsl/tests/test_python_apps/test_import_fail.py ..                      [ 45%]
2037parsl/tests/test_manual/test_regression_220.py s                         [ 46%]
2038parsl/tests/test_error_handling/test_python_walltime.py .                [ 46%]
2039parsl/tests/test_python_apps/test_worker_fail.py .                       [ 47%]
2040parsl/tests/test_docs/test_workflow1.py .                                [ 48%]
2041parsl/tests/sites/test_dynamic_executor.py s                             [ 48%]
2042parsl/tests/test_data/test_file_apps.py ..                               [ 49%]
2043parsl/tests/test_python_apps/test_type5.py ..                            [ 50%]
2044parsl/tests/test_python_apps/test_memoize_4.py .                         [ 51%]
2045parsl/tests/test_docs/test_from_slides.py .                              [ 51%]
2046parsl/tests/test_bash_apps/test_stdout.py ..........                     [ 57%]
2047parsl/tests/test_regression/test_69b.py ...s.                            [ 60%]
2048parsl/tests/test_bash_apps/test_file_bug_1.py s                          [ 61%]
2049parsl/tests/test_python_apps/test_fibonacci_iterative.py .               [ 61%]
2050parsl/tests/test_staging/test_docs_2.py s                                [ 62%]
2051parsl/tests/test_docs/test_workflow4.py .                                [ 62%]
2052parsl/tests/test_checkpointing/test_python_checkpoint_3.py s             [ 63%]
2053parsl/tests/test_checkpointing/test_periodic.py s                        [ 63%]
2054parsl/tests/test_bash_apps/test_apptimeout.py .
2055
2056No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
2057Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
2058
2059The build has been terminated

dthain commented 4 years ago

@tjdasso why don't you stop by and show me the exact code/script that is getting stuck. I might be able to give you some pointers on what to look out for.

On a related note, we have worked through some interesting boundary conditions in making the cctools WQ tests run reliably, and settled on this strategy: https://github.com/cooperative-computing-lab/cctools/blob/master/work_queue/test/work_queue_common.sh

benclifford commented 4 years ago

note that are two issues here: one is this one, #1437, which is to do with hangs in CI; the other is #1228 which is to do with running the parsl test suite against workqueue anywhere (including CI), and that test suite failing with some error that isn't clear to me.

They might be related, but I haven't seen evidence that they are.

tjdasso commented 4 years ago

I've been trying to reproduce this issue locally on my laptop, but my tests have successfully run to completion. However, one thing I noticed is that each individual test requires the worker to disconnect and reconnect with the master process when using the --bodge-dfk-per-test option, which relates back to the original problem in #1228 of needing this fix in the first place. @dthain mentioned that the continual process of disconnecting and reconnecting the worker with the master has caused issues in the past with WorkQueue testing. The strategy he mentioned with https://github.com/cooperative-computing-lab/cctools/blob/master/work_queue/test/work_queue_common.sh might work for CI testing purposes - it looks like the master process accepts any available port and then writes the port to a file, where the worker is then spun up with that port. If the worker doesn't continually disconnect and reconnect after each test, this might be a more foolproof testing strategy for CI.

tjdasso commented 4 years ago

I've been trying to reproduce this issue locally on my laptop, but my tests have successfully run to completion. However, one thing I noticed is that each individual test requires the worker to disconnect and reconnect with the master process when using the --bodge-dfk-per-test option, which relates back to the original problem in #1228 of needing this fix in the first place. @dthain mentioned that the continual process of disconnecting and reconnecting the worker with the master has caused issues in the past with WorkQueue testing. The strategy he mentioned with https://github.com/cooperative-computing-lab/cctools/blob/master/work_queue/test/work_queue_common.sh might work for CI testing purposes - it looks like the master process accepts any available port and then writes the port to a file, where the worker is then spun up with that port. If the worker doesn't continually disconnect and reconnect after each test, this might be a more foolproof testing strategy for CI.

benclifford commented 4 years ago

The non-bodge invocation of pytest should be the goal, I think: one DFK, one connection to workqueue, lots of interesting work done in that one session. So maybe it's more useful to fix #1228 and then hope that this hang goes away?

tjdasso commented 4 years ago

How does the wrapper script access the exit code of the @bash_app after the function is executed? With Python apps the workqueue_worker.py script catches the Exception of the function call, and successfully determines the error from there, but with Bash apps, the error seems to be caught within the Bash wrapper. Is there a way to access the returncode variable from the Bash app wrapper?

tjdasso commented 4 years ago

Ah, I see that the function naturally returns the exit code of the Bash app. Is there any way for the remote worker to know whether the function being run is a Bash app or a Python app? Because a Python app might successfully return a non-zero exit code, but for a Bash app this would indicate failure at the worker.

benclifford commented 4 years ago

The worker always runs a python function. The python function is usually some elaborated form wrapping the original app function in the user's source code. Bash apps wrap with remote_side_bash_executor, but other wrappings can happen to: for example both the parsl monitoring and file staging code might add in their own wrappers around the user function.

remote_side_bash_executor is a wrapper that runs in the worker and does this: runs the user function to get command line to run. Runs that command line. Looks to see what the exit code was and either returns 0 or raises an exception.

So at that level there is no notion of a "return code" - just a python exception or a python return value which in general could be anything - the basic interface is return any python object or raise an exception.

I'm unclear how that maps into workqueue tasks though.

benclifford commented 2 years ago

Work Queue tests have not hung in CI in a long time. Closing.

Parsl / parsl

Workqueue CI tests hang often in CI #1437