Open RuRo opened 4 years ago
@RuRo this is an unfortunate issue. The working hypothesis is that there is a bug in one of the numpy operators, causing some later test to hang. @szha is helping to switch CI to run all unittests via naive engine instead of threadedengine, which may help prevent buggy operator A to cause a hang of some later test. There is also the idea of timing out the job if there is no new output for 30 minutes, but nobody owns implementing this yet AFAIK. Any help would be appreciated.
@josephevans is looking into a 30m auto timeout.
The 3 hours you see it the global timeout we are setting for jobs. So that abort is intentional
Okay! I think I was able to track down the source of this problem. I noticed, that the last 3 times the pipeline froze for me, the last output was test_operator_gpu.test_np_true_divide
. The next test after that, that is supposed to run is test_operator_gpu.test_np_unary_bool_funcs
.
After ducking around with test_operator_gpu.test_np_unary_bool_funcs
a bit, I was able to reproduce the hanging behaviour locally. I just ran the test in a loop, printing something in between runs and after a while, the process got stuck and stopped printing.
It didn't happen every time and setting PRNG seeds didn't make it fail deterministically, so the issue is probably thread-related. Also, it definitely doesn't happen with MXNET_ENGINE_TYPE=NaiveEngine
.
After making a reproducible setup, I spent some time bisecting the test_operator_gpu.test_np_unary_bool_funcs
test. In the end, I was able to narrow it down to the np.empty_like
operator:
import mxnet as mx
with mx.Context(mx.gpu()):
while True:
for _ in range(100):
mx_data = mx.np.array(0)
mx_x = mx.np.empty_like(mx_data)
print(end='.', flush=True)
The above piece of code executes the empty_like
operator and prints a dot once every 100 iterations and after ~20 seconds it gets stuck. I took a quick look at how empty_like
is implemented but wasn't able to figure out, what exactly is wrong with it. It looks like a really thin wrapper around CustomOp
, which is used from C++, so I am giving up for now.
@RuRo Thank you for investigating the issue! Unfortunately I can't repro based on your script. Can you provide more details on your environment (mxnet version, cuda version, gpu architecture, etc)?
However, interrupting (ctrl+c
) the repro script that you shared would sometimes print
Segmentation fault: 11
but sometimes also provide a meaningful backtrace
Traceback (most recent call last):
File "test.py", line 7, in <module>
mx_x = mx.np.empty_like(mx_data)
File "/home/ubuntu/src/mxnet-master/python/mxnet/numpy/multiarray.py", line 2579, in empty_like
return _mx_nd_np.empty_like(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
File "/home/ubuntu/src/mxnet-master/python/mxnet/ndarray/numpy/_op.py", line 512, in empty_like
return _npi.empty_like_fallback(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
File "<string>", line 40, in Custom
File "/home/ubuntu/src/mxnet-master/python/mxnet/_ctypes/ndarray.py", line 82, in _imperative_invoke
check_call(_LIB.MXImperativeInvokeEx(
File "/home/ubuntu/src/mxnet-master/python/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "../include/mxnet/./tuple.h", line 141
MXNetError: Check failed: ndim() >= 0 (-1 vs. 0) :
You need some lucky timing to get this backtrace.. So try a few times if you can't see it
Here is my environment:
My mxnet install is built from source of the latest master a couple of days ago. CUDA version is 10.2. GPU is GTX1050.
The timing for the lockup to happen can vary quite a lot. I'd recommend starting a couple copies of this script and leaving them running for a minute or so to be absolutely sure.
Interrupting the script before it hangs indeed sometimes results in a Segmentation Fault
. After the script hangs, it seems to ignore SIGINT
completely and has to be killed by SIGTERM
or something like that.
Also, here is the feature list for my mxnet install
Update: some further testing revealed that
1) It's not a GPU specific problem, I removed with mx.Context(mx.gpu()):
and the problem persist.
2) Running the test with MXNET_ENABLE_CYTHON=0 python test.py
fixes the issue. (Both the Python 3: GPU
and Python 3: GPU TVM_OP OFF
test steps run the python3_gpu_ut_cython
pipeline, which uses cython)
Also, interestingly enough, all the CPU pipelines set export MXNET_ENABLE_CYTHON=0
.
@leezu can you try to reproduce with a recent mxnet version that is built with cython
support and run the script with MXNET_ENFORCE_CYTHON=1 python test.py
? Unfortunately, I wasn't able to find any prebuilt wheels/artifacts, that were built with cython
, so you might have to build from source yourself.
While we are waiting for somebody to reproduce, I've tried playing with the stuck process in gdb. Here is the backtrace of all the threads after they get stuck.
As you can see, all the threads seem to be either in pthread_cond_wait
, pthread_cond_timedwait
or pthread_mutex_lock
, so this is indeed probably a deadlock situation.
https://github.com/apache/incubator-mxnet/issues/18014 appears to be due to wrong mutex handling and may be related.
@RuRo wrt to reproducing, we should first fix the Check failed: ndim() >= 0 (-1 vs. 0) :
error triggered by the CustomOp
. It appears that CustomOp
swallows the error which may cause the subsequent hang
Are you sure, that the backtrace you got after interrupting is in any way related to this issue? It doesn't seem likely to me.
1) The backtrace you got is in _ctypes/ndarray.py
. The issue I was able to reproduce is in the cython
bindings. If I understand correctly, when cython
is used _ctypes
isn't even imported. Instead, _cy3.ndarray
would be used. I mentioned, that I also sometimes got Segmentation Fault:11
after interrupting the script, but I don't get the Check failed: ndim() >= 0 (-1 vs. 0)
error.
2) I don't see anything too incriminating in that backtrace? In my experience mxnet (or any other package with C bindings) often spews random errors, if interrupted. For example, you can also get this backtrace
```
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 237, in 'calling callback function'
File "/usr/lib/python3.8/site-packages/mxnet/operator.py", line 973, in declare_backward_dependency_entry
rdeps = cast(c_array_buf(c_int, array('i', rdeps)), c_int_p)
File "/usr/lib/python3.8/site-packages/mxnet/base.py", line 474, in c_array_buf
return (ctype * len(buf)).from_buffer(buf)
KeyboardInterrupt
Traceback (most recent call last):
File "test.py", line 6, in <module>
mx_x = mx.np.empty_like(mx_data)
File "/usr/lib/python3.8/site-packages/mxnet/numpy/multiarray.py", line 2582, in empty_like
return _mx_nd_np.empty_like(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
File "/usr/lib/python3.8/site-packages/mxnet/ndarray/numpy/_op.py", line 512, in empty_like
return _npi.empty_like_fallback(prototype, dtype=dtype, order=order, subok=subok, shape=shape)
File "<string>", line 40, in Custom
File "mxnet/cython/ndarray.pyx", line 219, in mxnet._cy3.ndarray._imperative_invoke
File "mxnet/cython/./base.pyi", line 41, in mxnet._cy3.ndarray.CALL
mxnet.base.MXNetError: Traceback (most recent call last):
File "/home/custompkgs/PKGBUILDS/mxnet-ruro-git/src/mxnet-ruro-git/src/operator/custom/custom.cc", line 121
MXNetError: Check failed: reinterpret_cast<CustomOpBwdDepFunc>( params.info->callbacks[kCustomOpPropDeclareBackwardDependency])( out_grad.data(), in_data.data(), out_data.data(), &num_dep, &rdeps, params.info->contexts[kCustomOpPropDeclareBackwardDependency]):
```
</details> or <details><summary>this backtrace with a segfault</summary>
```
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 237, in 'calling callback function'
File "/usr/lib/python3.8/site-packages/mxnet/operator.py", line 935, in list_arguments_entry
try:
KeyboardInterrupt
Segmentation fault: 11
terminate called without an active exception
[1] 1437249 abort (core dumped) python test.py
```
</details>.
3) I grepped around a little bit and it seems to me that every call to cffi
/cython
is properly wrapped with check_call
or the equivalent and I don't see any suspicious try/except
clauses in the python part of CustomOp
implementation.
Also, I don't quite understand, even if there are unhandled errors somewhere in CustomOp
, I still don't see, how would that cause a deadlock?
@RuRo, @reminisce helped point out an error in your reproducer script. You use zero-shape tensor, but didn't enable numpy shape semantics. It should be
import mxnet as mx
mx.npx.set_np()
with mx.Context(mx.gpu()):
while True:
for _ in range(100):
mx_data = mx.np.array(0)
mx_x = mx.np.empty_like(mx_data)
mx.nd.waitall()
print(end='.', flush=True)
This explains the error in https://github.com/apache/incubator-mxnet/issues/18090#issuecomment-616025939
You're right, it shouldn't cause a deadlock.
@leezu can you try to reproduce with a recent mxnet version that is built with cython support and run the script with MXNET_ENFORCE_CYTHON=1 python test.py? Unfortunately, I wasn't able to find any prebuilt wheels/artifacts, that were built with cython, so you might have to build from source yourself.
I can reproduce the hang with the cmake -DBUILD_CYTHON_MODULES=1
build.
@RuRo, @reminisce helped point out an error in your reproducer script. You use zero-shape tensor, but didn't enable numpy shape semantics. It should be
Ah, I see. I had the npx.set_np()
call, when I started reproducing this issue, but I removed it while trying to simplify the bug since it made no difference for the hanging behaviour. I was under the impression, that scalar arrays are enabled by default for the mx.np
kind of array. My bad.
Also, the hang happens with or without the mx.nd.waitall()
call. Also, as previously mentioned, you don't need the GPU context. So the short version is
import mxnet as mx
mx.npx.set_np()
while True:
for _ in range(100):
mx_data = mx.np.array(0)
mx_x = mx.np.empty_like(mx_data)
print(end='.', flush=True)
Any action item to unblock the ci now?
With respect to the numpy op causing triggering the issue: We can replace those based on CustomOp with native operators. There are only 4 in this file https://github.com/apache/incubator-mxnet/blob/f619c52dfc3787e3097b37ba28f7065bca910b97/python/mxnet/numpy_op_fallback.py
For CustomOp itself, we may want to rewrite it based on the PackedFunc or drop it.
Are we sure, that this is 100% a CustomOp
specific problem and not a deeper issue with the current cython
implementation, that CustomOp
just happened to bring to light? After all, the issue doesn't occur with MXNET_ENABLE_CYTHON=0
.
Also, dropping CustomOp
doesn't seem to be a good idea, since it's the current public API for implementing custom operators in higher-level languages.
Also, @zhreshold the CI is technically not blocked since the test is just unstable and not completely impossible. This issue was around for a while and until now the solution was just to restart unix-gpu
with the mxnet-bot
until it passes. Not a pretty solution, but if you are in a hurry, it's an option.
Also, @leezu if fixing CI is a priority, we could temporarily disable all the CI tests, which use the CustomOp
based operators, while a proper fix is being worked on.
By the way, the official tutorial for CustomOp
claims that you must wait for all CustomOp
s to complete before fork
ing or else "the program will be blocked because of python GIL". I am pretty sure, I saw new threads being created while the main loop was already running, while I was debugging with GDB. Could this be the problem?
A little further investigation in GDB with some debug symbols revealed the following information:
Thread 1 (main thread) is in CustomOperator::Find
. (it shows up as mxnet::op::custom::AttrParser
in the backtrace because of inlining). Thread 1 is waiting for the CustomOperator::mutex_
, which is currently held by Thread 13.
Thread 13 is in _CallPythonObject
(python internals called from libffi
), trying to call register.do_register.creator.create_operator_entry.delete_entry
which is a callback for CustomOp::del
. To do that, it tries to acquire the Global Interpreter Lock, which is currently held by Thread 1.
I am really not sure, how could Thread 13 arrive at the CustomOp
deleter, while holding the CustomOperator::mutex
.
, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x7fff142e60e8, kwcount=, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name='Custom', qualname='Custom') at Python/ceval.c:4298
#14 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=, stack=0x7fff142e60e0, nargsf=, kwnames=) at Objects/call.c:435
#15 0x00007ffff7b476da in _PyObject_FastCallDict (callable=, args=0x7ffef55f16f8, nargsf=, kwargs=) at Objects/call.c:104
#16 0x00007ffff7bf2fbe in partial_fastcall (pto=, pto=, kwargs=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
, nargs=1, args=) at ./Modules/_functoolsmodule.c:169
#17 partial_call (pto=0x7fff17ec30c0, args=, kwargs=) at ./Modules/_functoolsmodule.c:224
#18 0x00007ffff7b1325f in _PyObject_MakeTpCall (callable=, args=, nargs=, keywords=) at Objects/call.c:168
#19 0x00007ffff7b83429 in _PyObject_Vectorcall (kwnames=, nargsf=, args=0x555557e4ce90, callable=) at Python/ceval.c:1555
#20 call_function (kwnames=, oparg=, pp_stack=, tstate=0x55555557e5f0) at Python/ceval.c:4987
#21 _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3515
#22 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x7fff17ed41f0, kwcount=, kwstep=1, defs=0x7fff78fb2488, defcount=4, kwdefs=0x0, closure=0x0, name='empty_like', qualname='empty_like') at Python/ceval.c:4298
#23 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=, stack=0x7fff17ed41e8, nargsf=, kwnames=) at Objects/call.c:435
#24 0x00007ffff7b7ecb9 in _PyObject_Vectorcall (kwnames=('dtype', 'order', 'subok', 'shape'), nargsf=, args=0x7fff17ed41e8, callable=) at ./Include/cpython/abstract.h:92
#25 call_function (kwnames=('dtype', 'order', 'subok', 'shape'), oparg=, pp_stack=, tstate=0x55555557e5f0) at Python/ceval.c:4987
#26 _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3515
#27 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x5555555e49e8, kwcount=, kwstep=1, defs=0x7fff78be8a88, defcount=4, kwdefs=0x0, closure=0x0, name='empty_like', qualname='empty_like') at Python/ceval.c:4298
#28 0x00007ffff7b4c969 in _PyFunction_Vectorcall (func=, stack=0x5555555e49e0, nargsf=, kwnames=) at Objects/call.c:435
#29 0x00007ffff7b7e2ea in _PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775809, args=0x5555555e49e0, callable=) at ./Include/cpython/abstract.h:92
#30 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55555557e5f0) at Python/ceval.c:4987
#31 _PyEval_EvalFrameDefault (f=, throwflag=) at Python/ceval.c:3469
#32 0x00007ffff7b4bc9a in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=, kwargs=0x0, kwcount=, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4298
#33 0x00007ffff7be52da in PyEval_EvalCodeEx (_co=, globals=, locals=, args=, argcount=, kws=, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4327
#34 0x00007ffff7be52fc in PyEval_EvalCode (co=co@entry=, globals=globals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
, locals=locals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
) at Python/ceval.c:718
#35 0x00007ffff7be53aa in run_eval_code_obj (co=0x7ffff73952b0, globals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
, locals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
) at Python/pythonrun.c:1125
#36 0x00007ffff7c2a134 in run_mod (mod=, filename=, globals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
, locals=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
, flags=, arena=) at Python/pythonrun.c:1147
#37 0x00007ffff7af1a95 in PyRun_FileExFlags (fp=fp@entry=0x555555621020, filename_str=filename_str@entry=0x7ffff742aee0 "test.py", start=start@entry=257, globals=globals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
, locals=locals@entry=Python Exception Attempt to extract a component of a value that is not a struct/class/union.:
, closeit=closeit@entry=1, flags=0x7ffffffe0578) at Python/pythonrun.c:1063
#38 0x00007ffff7af4600 in PyRun_SimpleFileExFlags (fp=fp@entry=0x555555621020, filename=, closeit=closeit@entry=1, flags=flags@entry=0x7ffffffe0578) at Python/pythonrun.c:428
#39 0x00007ffff7af4995 in PyRun_AnyFileExFlags (fp=fp@entry=0x555555621020, filename=0x0, closeit=closeit@entry=1, flags=flags@entry=0x7ffffffe0578) at Python/pythonrun.c:86
#40 0x00007ffff7c2c796 in pymain_run_file (cf=0x7ffffffe0578, config=0x55555557d8e0) at Modules/main.c:381
#41 pymain_run_python (exitcode=0x7ffffffe0570) at Modules/main.c:565
#42 Py_RunMain () at Modules/main.c:644
#43 0x00007ffff7c2c889 in Py_BytesMain (argc=, argv=) at Modules/main.c:698
#44 0x00007ffff7dbd023 in __libc_start_main () at /usr/lib/libc.so.6
#45 0x000055555555505e in _start ()
(gdb) print (**(pthread_mutex_t **)($sp+0x10)).__data.__owner
2252813
(gdb) info thread
Id Target Id Frame
...
13 Thread 0x7ffef77fe700 (LWP 2252813) "python" 0x00007ffff799e01a in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) thread 13
(gdb) backtrace
#0 0x00007ffff799e01a in pthread_cond_timedwait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1 0x00007ffff7b0a9fb in PyCOND_TIMEDWAIT (us=, mut=0x7ffff7d759b0 <_PyRuntime+1232>, cond=0x7ffff7d75980 <_PyRuntime+1184>) at Python/condvar.h:73
#2 take_gil (tstate=0x7ffee80022b0, ceval=0x7ffff7d75728 <_PyRuntime+584>) at Python/ceval_gil.h:206
#3 PyEval_RestoreThread (tstate=0x7ffee80022b0) at Python/ceval.c:399
#4 0x00007ffff7a30b8d in PyGILState_Ensure () at Python/pystate.c:1298
#5 0x00007ffff7217d4c in _CallPythonObject (pArgs=0x7ffef77fdc50, flags=, converters=(<_ctypes.PyCSimpleType at remote 0x5555556405f0>,), callable=, setfunc=, restype=0x7ffff725b638, mem=0x7ffef77fddd0) at /home/custompkgs/PKGBUILDS/python-dbg/src/Python-3.8.2/Modules/_ctypes/callbacks.c:145
#6 closure_fcn (cif=, resp=0x7ffef77fddd0, args=0x7ffef77fdc50, userdata=) at /home/custompkgs/PKGBUILDS/python-dbg/src/Python-3.8.2/Modules/_ctypes/callbacks.c:297
#7 0x00007ffff71938c2 in () at /usr/lib/libffi.so.7
#8 0x00007ffff7193c20 in () at /usr/lib/libffi.so.7
#9 0x00007fffc69ba782 in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#10 0x00007fffc624b0b7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#11 0x00007fffc69bb33e in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#12 0x00007fffc69bfeca in () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#13 0x00007fffc62c1794 in std::thread::_State_impl > >::_M_run() () at /usr/lib/python3.8/site-packages/mxnet/libmxnet.so
#14 0x00007fff7df19b24 in std::execute_native_thread_routine(void*) (__p=0x555557d8b890) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#15 0x00007ffff799746f in start_thread () at /usr/lib/libpthread.so.0
#16 0x00007ffff7e953d3 in clone () at /usr/lib/libc.so.6
(gdb) print *(PyFunctionObject *)0x7ffef568e050
{
...
func_doc = 'C Callback for CustomOp::del',
func_name = 'delete_entry',
func_dict = 0x0,
func_weakreflist = 0x0,
func_module = 'mxnet.operator',
func_annotations = 0x0,
func_qualname = 'register..do_register..creator..create_operator_entry..delete_entry',
vectorcall = 0x7ffff7b4c750 <_PyFunction_Vectorcall>
}
(gdb) print/x ((PyThreadState*)_PyRuntime.ceval.gil.last_holder)->thread_id
0x7ffff7839740
(gdb) info thread
Id Target Id Frame
1 Thread 0x7ffff7839740 (LWP 2252671) "python" 0x00007ffff79a14cf in __lll_lock_wait () from /usr/lib/libpthread.so.0
...
```
Just FYI, I am currently not working on a fix for this. Unfortunately, I am currently very busy and will probably be so for quite a while, also I am not very familiar with cffi
/cython
. So I won't be able to help much, apart from maybe a little bit of testing/discussion.
Just wanted to point that out, in case I created a false expectation, that I am working on this issue and therefore it doesn't need somebody else to come in and fix it.
Description
There currently exists some weird behaviour with
unix-gpu
CI jobs, where the build sometimes gets aborted and other times completes fine. I've seen this multiple times on different PRs.Until today, I thought, that this is caused by limited available GPU executors and the jobs are getting manually aborted or aborted by some automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or something).
However, I've noticed a few weird consistent things about these aborted jobs, so I wanted to make sure, that the current behaviour is intentional.
1) Of the cases I've seen,
2) Normally, these steps seem to take around 1 hour to complete. But in the cases, where they were aborted, it was after 3 hours. Additionally, there is a weird jump in the logs, between the time of the last log message from the test and the first message from shutting down due to the interrupt signal. These are consecutive log messages, but you can see a huge time skip between
unix-gpu
getting aborted, it's almost always in a situation, where all the build steps and all the other tests were completed, but there is just a singlePython 3: GPU
orPython 3: GPU (TVM_OP OFF)
test step that was aborted.test_operator_gpu.test_np_empty ... ok
at14:58
andSending interrupt signal to process
at17:33
.Occurrences
Here are the 2 examples from the screenshots: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline and a random example, not from my PRs: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline
I am aware, that we can just restart the job via mxnet-bot, but this is annoying since the job takes a long time to complete even without this issue. Can somebody clarify, if 1)
unix-gpu
CI jobs getting aborted is intentional (and what is the current policy on aborting CI jobs etc) 2) if it is intentional, is there something we can do to at the very least abort the tests faster or maybe not even fail these jobs, but automatically reschedule them (or preferably reschedule just the aborted step, not the whole pipeline)