aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
444 stars 148 forks source link

SIGSEGV Calling nrt_execute #673

Closed xanderdunn closed 1 year ago

xanderdunn commented 1 year ago

I am unable to reopen an issue once it has been closed (#670), so I have to open this new issue:

I'm having difficulty debugging this segfault. It's occurring within libnrt. Although it does not occur when running the same neff with neuron-bench, I have been unable to determine how our usage of the NRT C API differs. It is as far as I can tell in line with what's posted here.

Instrumenting our code with the address sanitizer provides no further insight because the segfault occurs within libnrt, which is not instrumented with the address sanitizer.

When I attach to the core dump with gdb this is the full backtrace I see:

(gdb) bt
#0  0x00007fc15fb2228a in al_mem_read_buf (src=<optimized out>, data=<optimized out>, size=<optimized out>) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/hal_platform.c:100
#1  0x00007fc15fb79cfe in aws_hal_notific_nq_read_buf (size=4194016, dst=0x7fc15efa6c20, offset=<optimized out>, nq=0x7fc15e7f3538)
    at /local/p4clients/pkgbuild-amMRg/workspace/src/KaenaHal/src/sunda/notific/aws_hal_notific_nq.c:130
#2  aws_hal_notific_nq_copy_entry (written=0x7fc15f3a6b68, buf_size=<optimized out>, dst=0x7fc15efa6c20, nq=0x7fc15e7f3538)
    at /local/p4clients/pkgbuild-amMRg/workspace/src/KaenaHal/src/sunda/notific/aws_hal_notific_nq.c:361
#3  aws_hal_notific_nq_read (nq=0x7fc15e7f3538, wait=wait@entry=false, buffer=0x7fc15efa6c20, buf_size=<optimized out>, written=0x7fc15f3a6b68)
    at /local/p4clients/pkgbuild-amMRg/workspace/src/KaenaHal/src/sunda/notific/aws_hal_notific_nq.c:402
#4  0x00007fc15fb4be02 in notification_nq_read (ens_nq=<optimized out>, buf=<optimized out>, buf_size=<optimized out>, written_size=<optimized out>, wait=false)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:994
#5  0x00007fc15fb4db5d in notification_read_exec_queue (notif=<optimized out>, type=<optimized out>, nq=..., buffer=<optimized out>, buf_size=<optimized out>, written_size=<optimized out>)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:1009
#6  0x00007fc15fb4dd70 in notification_consume_error_block (mla=mla@entry=0x7fc15e7f3018, notif=notif@entry=0x7fc15e7f3080, tpb=tpb@entry=true, end_ts=end_ts@entry=22686174997668,
    consume_all=consume_all@entry=false, error_count_array=error_count_array@entry=0x7fc15f3a7060) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:1295
#7  0x00007fc15fb4e7e8 in notification_consume_errors (mla=mla@entry=0x7fc15e7f3018, tpb=tpb@entry=0x7fc15e7f3060, end_ts=22686174997668, consume_all=false,
    error_count_array=error_count_array@entry=0x7fc15f3a7060) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:1445
#8  0x00007fc15fb63aa6 in exec_infer_wait_one (mla=0x7fc15e7f3018, tpb_idx=0, mod=mod@entry=0x7fc15888f830, inference_id=inference_id@entry=0, out_info=out_info@entry=0x7fc15f3a7060)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/exec.c:548
#9  0x00007fc15fb538ce in kbl_infer_exec_wait (mod=0x7fc15888f830, inference_id=inference_id@entry=0, start_vtpb_id=start_vtpb_id@entry=0, tpb_count=1, compute_req_idx=<optimized out>,
    out_info=out_info@entry=0x7fc15f3a7060) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/tdrv.c:1397
#10 0x00007fc15fa85162 in dlr_infer (dlr_mod=dlr_mod@entry=0x7fc12392af90, inference_id=inference_id@entry=0, range=..., in_ifmap_set=in_ifmap_set@entry=0x7fc15900eaf0,
    out_ifmap_set=out_ifmap_set@entry=0x7fc1582a2d10, output_info=output_info@entry=0x7fc15f3a7060) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/kmgr/dlr.cpp:2242
#11 0x00007fc15fa85623 in kmgr_infer (h_nn=h_nn@entry=..., in_set=in_set@entry=0x7fc15900eaf0, out_set=out_set@entry=0x7fc1582a2d10, loop_end_value=1, loop_end_value@entry=0)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/kmgr/dlr.cpp:1759
#12 0x00007fc15f999cca in nrt_infer (repeat_count=0, out_set=0x7fc1582a2d10, in_set=0x7fc15900eaf0, model=0x7fc15900e790) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:48
#13 nrt_execute_repeat (model=model@entry=0x7fc15900e790, input=input@entry=0x7fc15900eaf0, output=output@entry=0x7fc1582a2d10, repeat_count=repeat_count@entry=0)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:69
#14 0x00007fc15f999ee8 in nrt_execute (model=0x7fc15900e790, input=0x7fc15900eaf0, output=0x7fc1582a2d10) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:80
#15 0x000055d24a38ecf8 in xla::xla_runner::XLARunner::run_trn (self=0x55d24a715014 <<xla::tensor::ops::tests::RUNNER as core::ops::deref::Deref>::deref::__stability::LAZY+4>, xla_hlo_pb_path=..., run_name=...,
    input_names=..., inputs=..., input_shapes=..., benchmark=false) at xla/src/xla_runner.rs:282
#16 0x000055d24a38cfba in xla::xla_runner::XLARunner::run (self=0x55d24a715014 <<xla::tensor::ops::tests::RUNNER as core::ops::deref::Deref>::deref::__stability::LAZY+4>, xla_hlo_pb_path=..., run_name=...,
    input_names=..., inputs=..., input_shapes=..., benchmark=false) at xla/src/xla_runner.rs:108
#17 0x000055d24a4bf3ac in xla::nn::transformer::tests::transformer_xla_benchmark () at xla/src/nn/transformer.rs:227
#18 0x000055d24a55a537 in xla::nn::transformer::tests::transformer_xla_benchmark::{{closure}} () at xla/src/nn/transformer.rs:178
#19 0x000055d24a5447c5 in core::ops::function::FnOnce::call_once () at /rustc/c4190f2d3a46a59f435f7b42f58bc22b2f4d6917/library/core/src/ops/function.rs:250
#20 0x000055d24a5950af in core::ops::function::FnOnce::call_once () at library/core/src/ops/function.rs:250
#21 test::__rust_begin_short_backtrace () at library/test/src/lib.rs:655
#22 0x000055d24a5616dc in test::run_test::{{closure}} () at library/test/src/lib.rs:646
#23 core::ops::function::FnOnce::call_once{{vtable-shim}} () at library/core/src/ops/function.rs:250
#24 0x000055d24a593fe6 in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at library/alloc/src/boxed.rs:1985
#25 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () at library/core/src/panic/unwind_safe.rs:271
#26 std::panicking::try::do_call () at library/std/src/panicking.rs:485
#27 std::panicking::try () at library/std/src/panicking.rs:449
#28 std::panic::catch_unwind () at library/std/src/panic.rs:140
#29 test::run_test_in_process () at library/test/src/lib.rs:678
#30 test::run_test::run_test_inner::{{closure}} () at library/test/src/lib.rs:572
#31 0x000055d24a55bd88 in test::run_test::run_test_inner::{{closure}} () at library/test/src/lib.rs:599
#32 std::sys_common::backtrace::__rust_begin_short_backtrace () at library/std/src/sys_common/backtrace.rs:134
#33 0x000055d24a56134b in std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} () at library/std/src/thread/mod.rs:529
#34 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () at library/core/src/panic/unwind_safe.rs:271
#35 std::panicking::try::do_call () at library/std/src/panicking.rs:485
#36 std::panicking::try () at library/std/src/panicking.rs:449
#37 std::panic::catch_unwind () at library/std/src/panic.rs:140
#38 std::thread::Builder::spawn_unchecked_::{{closure}} () at library/std/src/thread/mod.rs:528
--Type <RET> for more, q to quit, c to continue without paging--
#39 core::ops::function::FnOnce::call_once{{vtable-shim}} () at library/core/src/ops/function.rs:250
#40 0x000055d24a6350a5 in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at library/alloc/src/boxed.rs:1985
#41 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at library/alloc/src/boxed.rs:1985
#42 std::sys::unix::thread::Thread::new::thread_start () at library/std/src/sys/unix/thread.rs:108
#43 0x00007fc15f5cb609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#44 0x00007fc15f854133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This is the code that calls the NRT C API:

            // Read NEFF file into a byte vector
            let mut neff_file = File::open(neff_path.clone())
                .unwrap_or_else(|_| panic!("Unable to open NEFF file {}", neff_path));
            let mut neff_data: Vec<u8> = Vec::new();
            neff_file
                .read_to_end(&mut neff_data)
                .expect("Unable to read NEFF file");
            let neff_size = neff_data.len();

            // Load the model
            let mut model: *mut nrt::nrt_model_t = std::ptr::null_mut();
            assert_eq!(model, std::ptr::null_mut());
            assert!(model.is_null());
            let result = unsafe {
                nrt::nrt_load(
                    neff_data.as_ptr() as *const _,
                    neff_size,
                    0, // neuron core index to start from
                    1, // number of neuron cores to allocate the model to
                    &mut model as *mut *mut nrt::nrt_model_t,
                )
            };
            assert!(!model.is_null());
            assert_eq!(result, nrt::NRT_STATUS_NRT_SUCCESS);

            // Allocate input and ouptut tensors
            let mut tensor_info_array: *mut nrt::nrt_tensor_info_array_t = std::ptr::null_mut();
            assert_eq!(tensor_info_array, std::ptr::null_mut());
            assert!(tensor_info_array.is_null());
            let result = unsafe {
                nrt::nrt_get_model_tensor_info(
                    model,
                    &mut tensor_info_array as *mut *mut nrt::nrt_tensor_info_array_t,
                )
            };
            assert!(!tensor_info_array.is_null());
            assert_eq!(result, nrt::NRT_STATUS_NRT_SUCCESS);

            let nrt_inputs = unsafe {
                allocate_tensors(
                    tensor_info_array,
                    nrt::nrt_tensor_usage_NRT_TENSOR_USAGE_INPUT,
                )
            };
            let mut nrt_inputs = nrt_inputs.expect("Error allocating input tensors");
            let outputs = unsafe {
                allocate_tensors(
                    tensor_info_array,
                    nrt::nrt_tensor_usage_NRT_TENSOR_USAGE_OUTPUT,
                )
            };
            let mut outputs = outputs.expect("Error allocating output tensors");

            // Note that even if input parameters are not initialized, it will
            // still run and it will still produce values.
            if !inputs.is_empty() {
                let result = unsafe {
                    load_tensor_values(
                        nrt_inputs,
                        tensor_info_array,
                        nrt::nrt_tensor_usage_NRT_TENSOR_USAGE_INPUT,
                        inputs,
                    )
                };
                result.expect("Error loading input tensor values");
            }

            // Run it
            assert!(!model.is_null());
            assert!(!nrt_inputs.is_null());
            assert!(!outputs.is_null());
            let result = unsafe { nrt::nrt_execute(model, nrt_inputs, outputs) }; // segfault here

The segfault is inside libnrt's nrt_execute -> nrt_infer -> exec_infer_wait_one -> notification_consume_errors -> notification_consume_error_block. Do the consume_error calls indicate that an error occurred? Could I get some help in understanding this call stack and what might be happening here? Our code's usage of the NRT C API works to execute some .neffs, while others produce the above segfault.

If it helps, I could provide a minimal reproducing code repository with a one line command to reproduce the segfault. Any additional debugging strategy ideas will be helpful.

Thanks

jluntamazon commented 1 year ago

Hello @xanderdunn, I think it could be helpful if we had some minimally reproducing code.

From inspection, there is nothing obviously incorrect about the code above, so this may require a little bit more digging from our side.

xanderdunn commented 1 year ago

Ok thanks, I will get that minimally reproducing example set up ASAP

xanderdunn commented 1 year ago

@jluntamazon Here is a minimal reproducing sample with setup and run instructions: https://github.com/JasnahOrg/nrt_segfault_repro

I tried this just now on a fresh trn1.2xlarge machine and it ran as described in the README. Please don't hesitate to ask me questions about it, and thank you very much for trying it!

jluntamazon commented 1 year ago

Thank you! We will let you know what we find as soon as possible

awsilya commented 1 year ago

@xanderdunn missed a not so great bit of code on our side ... this is what happens:

  1. your network generates a large number of error notifications - overflow/underflow, which is pretty much expected given random inputs.
  2. runtime reads all of them at once and allocates the space for them on the stack

It appears that the default stack size of rust programs is smaller which exposes the problem. We will fix this bug in the runtime in the next release, in the meantime, hopefully the workaround of setting larger rust stack size should work for you.

awsilya commented 1 year ago

running with the fix:

ubuntu@ip-10-0-10-142:~/nrt_segfault_repro$ cargo test transformer_xla_benchmark -- --show-output --nocapture Finished test [unoptimized + debuginfo] target(s) in 0.01s Running unittests src/lib.rs (target/debug/deps/xla-2d113f5803445073)

running 1 test Done test trn::tests::transformer_xla_benchmark ... ok

successes:

successes: trn::tests::transformer_xla_benchmark

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 8 filtered out; finished in 28.94s

xanderdunn commented 1 year ago

Ah that makes sense, thanks very much for figuring this out! Yes, increasing the stack size will work for us as a workaround.

For this particular test the presence of overflows and underflows is not much of a problem. Is the underflow / overflow error surfaced to the NRT C API in any way? For some tests it could be useful to know that there were no underflow / overflow errors.

awsilya commented 1 year ago

a couple of things: we consider underflow/overflow to be benign and don't return errors. We do log them at INFO level, each error type is only logged once. E.g.

$ RUST_MIN_STACK=104857600 NEURON_RT_LOG_LEVEL=INFO cargo test transformer_xla_benchmark -- --show-output --nocapture
....
2023-May-11 16:55:16.0314 30993:30994  WARN  TDRV:notification_consume_error_block        Received an unusually large number of error notifications (count:262126) on nd0 nc0. Performance may be negatively affected
2023-May-11 16:55:16.0318 30993:30994  WARN  TDRV:notification_consume_error_block        Error notifications found on nd0 nc0; action=INFER_ERROR_SUBTYPE_NONE; error_id=0; error string:TRAINIUM_NC_ERROR_TYPE_FP_UNDERFLOW 

NaN is considered an error and nrt_execute() will return an error code.

NRT_EXEC_COMPLETED_WITH_NUM_ERR = 1003, // execution was completed with numerical errors (produced NaN)
xanderdunn commented 1 year ago

I upgraded to Neuron SDK 2.11 and I can confirm that the above graphs run without hitting the error, no need to increase the stack size with RUST_MIN_STACK. Thanks very much for the bug fix!

We seem to be getting an NRT_EXEC_COMPLETED_WITH_NUM_ERR on one of the graphs in this issue that we previously didn't see that error on with Neuron 2.10, so I might need to open a new issue for that. But, everything is running without RUST_MIN_STACK.

xanderdunn commented 1 year ago

We seem to be getting an NRT_EXEC_COMPLETED_WITH_NUM_ERR on one of the graphs in this issue that we previously didn't see that error on with Neuron 2.10, so I might need to open a new issue for that. But, everything is running without RUST_MIN_STACK.

This was a mistake on my end, it's fixed and we're not getting the nan error. No need for a new issue! Looking good.