Closed xanderdunn closed 1 year ago
Hello @xanderdunn, I think it could be helpful if we had some minimally reproducing code.
From inspection, there is nothing obviously incorrect about the code above, so this may require a little bit more digging from our side.
Ok thanks, I will get that minimally reproducing example set up ASAP
@jluntamazon Here is a minimal reproducing sample with setup and run instructions: https://github.com/JasnahOrg/nrt_segfault_repro
I tried this just now on a fresh trn1.2xlarge machine and it ran as described in the README. Please don't hesitate to ask me questions about it, and thank you very much for trying it!
Thank you! We will let you know what we find as soon as possible
@xanderdunn missed a not so great bit of code on our side ... this is what happens:
It appears that the default stack size of rust programs is smaller which exposes the problem. We will fix this bug in the runtime in the next release, in the meantime, hopefully the workaround of setting larger rust stack size should work for you.
running with the fix:
ubuntu@ip-10-0-10-142:~/nrt_segfault_repro$ cargo test transformer_xla_benchmark -- --show-output --nocapture Finished test [unoptimized + debuginfo] target(s) in 0.01s Running unittests src/lib.rs (target/debug/deps/xla-2d113f5803445073)
running 1 test Done test trn::tests::transformer_xla_benchmark ... ok
successes:
successes: trn::tests::transformer_xla_benchmark
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 8 filtered out; finished in 28.94s
Ah that makes sense, thanks very much for figuring this out! Yes, increasing the stack size will work for us as a workaround.
For this particular test the presence of overflows and underflows is not much of a problem. Is the underflow / overflow error surfaced to the NRT C API in any way? For some tests it could be useful to know that there were no underflow / overflow errors.
a couple of things: we consider underflow/overflow to be benign and don't return errors. We do log them at INFO level, each error type is only logged once. E.g.
$ RUST_MIN_STACK=104857600 NEURON_RT_LOG_LEVEL=INFO cargo test transformer_xla_benchmark -- --show-output --nocapture
....
2023-May-11 16:55:16.0314 30993:30994 WARN TDRV:notification_consume_error_block Received an unusually large number of error notifications (count:262126) on nd0 nc0. Performance may be negatively affected
2023-May-11 16:55:16.0318 30993:30994 WARN TDRV:notification_consume_error_block Error notifications found on nd0 nc0; action=INFER_ERROR_SUBTYPE_NONE; error_id=0; error string:TRAINIUM_NC_ERROR_TYPE_FP_UNDERFLOW
NaN is considered an error and nrt_execute() will return an error code.
NRT_EXEC_COMPLETED_WITH_NUM_ERR = 1003, // execution was completed with numerical errors (produced NaN)
I upgraded to Neuron SDK 2.11 and I can confirm that the above graphs run without hitting the error, no need to increase the stack size with RUST_MIN_STACK
. Thanks very much for the bug fix!
We seem to be getting an NRT_EXEC_COMPLETED_WITH_NUM_ERR
on one of the graphs in this issue that we previously didn't see that error on with Neuron 2.10, so I might need to open a new issue for that. But, everything is running without RUST_MIN_STACK
.
We seem to be getting an NRT_EXEC_COMPLETED_WITH_NUM_ERR on one of the graphs in this issue that we previously didn't see that error on with Neuron 2.10, so I might need to open a new issue for that. But, everything is running without RUST_MIN_STACK.
This was a mistake on my end, it's fixed and we're not getting the nan error. No need for a new issue! Looking good.
I am unable to reopen an issue once it has been closed (#670), so I have to open this new issue:
I'm having difficulty debugging this segfault. It's occurring within libnrt. Although it does not occur when running the same neff with
neuron-bench
, I have been unable to determine how our usage of the NRT C API differs. It is as far as I can tell in line with what's posted here.Instrumenting our code with the address sanitizer provides no further insight because the segfault occurs within libnrt, which is not instrumented with the address sanitizer.
When I attach to the core dump with gdb this is the full backtrace I see:
This is the code that calls the NRT C API:
The segfault is inside libnrt's
nrt_execute
->nrt_infer
->exec_infer_wait_one
->notification_consume_errors
->notification_consume_error_block
. Do theconsume_error
calls indicate that an error occurred? Could I get some help in understanding this call stack and what might be happening here? Our code's usage of the NRT C API works to execute some .neffs, while others produce the above segfault.If it helps, I could provide a minimal reproducing code repository with a one line command to reproduce the segfault. Any additional debugging strategy ideas will be helpful.
Thanks