aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
444 stars 148 forks source link

SIGSEGV Calling nrt_execute #670

Closed xanderdunn closed 1 year ago

xanderdunn commented 1 year ago

When I attempt to call nrt_execute on either of these XLA graphs, I get a SIGSEGV. These are Transformers with parameters (n_context, n_layers, d_model, n_heads):

The core dumps were too large to attach to a GitHub issue but I can provide them if it would be useful.

The same code works to execute very similar graphs, for example this is the same Transformer model but with a smaller context size, and it runs without a SIGSEGV:

I am making this call from a Rust program that's calling the nrt_execute libnrt as a Foreign Function Interface based on the docs. The call is very simple:

extern "C" {
    #[doc = " Execute given model with given inputs and collect outputs.\n\n @param model[in] - Model to execute.\n @param input_set[in] - Set of input tensors.\n @param output_set[in] - Set of output tensors.\n\n @return NRT_STATUS_SUCCESS on success."]
    pub fn nrt_execute(
        model: *mut nrt_model_t,
        input_set: *const nrt_tensor_set_t,
        output_set: *mut nrt_tensor_set_t,
    ) -> NRT_STATUS;
}

let result = unsafe { nrt::nrt_execute(model, nrt_inputs, outputs) };

I ran these tests on a trn1.2xlarge instance.

$ neuronx-cc --version
NeuronX Compiler version 2.6.0.19+3d819e565

Python version 3.8.10
HWM version 2.6.0.0-826e77395
NEFF version Dynamic
TVM not available
NumPy version 1.21.6
MXNet not available
$ sudo apt list --installed | grep neuron
aws-neuronx-collectives/unknown,now 2.13.7.0-954e2e19b amd64 [installed]
aws-neuronx-dkms/unknown,now 2.9.4.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.2.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.13.6.0-29de104d6 amd64 [installed]
aws-neuronx-tools/unknown,now 2.10.1.0 amd64 [installed]

Taking a look at one of the core dumps:

$ coredumpctl gdb -1
           PID: 5246 (xla-3f4077dacd5)
           UID: 1000 (ubuntu)
           GID: 1000 (ubuntu)
        Signal: 11 (SEGV)
     Timestamp: Thu 2023-05-04 13:14:57 EDT (4min 56s ago)
  Command Line: /home/ubuntu/dev/Kholinar/target/debug/deps/xla-3f4077dacd51f8e1 transformer_xla_benchmark --show-output --nocapture --include-ignored
    Executable: /home/ubuntu/dev/Kholinar/target/debug/deps/xla-3f4077dacd51f8e1
 Control Group: /user.slice/user-1000.slice/session-1.scope
          Unit: session-1.scope
         Slice: user-1000.slice
       Session: 1
     Owner UID: 1000 (ubuntu)
       Boot ID: a711b977c2984109943da7a5c6c17f2b
    Machine ID: ec2f1b16a906cd0bc8a5d7d08decee75
      Hostname: xander-trainium
       Storage: /var/lib/systemd/coredump/core.xla-3f4077dacd5.1000.a711b977c2984109943da7a5c6c17f2b.5246.1683220497000000000000.lz4
       Message: Process 5246 (xla-3f4077dacd5) of user 1000 dumped core.

                Stack trace of thread 5247:
                #0  0x00007f07882c5d6b notification_consume_error_block (libnrt.so.1 + 0x20fd6b)
                #1  0x00007f07882c67e8 notification_consume_errors (libnrt.so.1 + 0x2107e8)
                #2  0x00007f07882dbaa6 exec_infer_wait_one (libnrt.so.1 + 0x225aa6)
                #3  0x00007f07882cb8ce kbl_infer_exec_wait (libnrt.so.1 + 0x2158ce)
                #4  0x00007f07881fd162 _Z9dlr_inferP14dlr_kelf_modelm10vtpb_rangePK2htPS2_P15kbl_output_info (libnrt.so.1 + 0x147162)
                #5  0x00007f0788213058 _Z10exec_modelPK14kelf_node_infomP11top_node_ioPN3tvm7runtime10grt_tensorE (libnrt.so.1 + 0x15d058)
                #6  0x00007f07882063c0 _ZN3tvm7runtime12GraphRuntime3RunEmPdS2_ (libnrt.so.1 + 0x1503c0)
                #7  0x00007f07881fc299 dlr_run_graph (libnrt.so.1 + 0x146299)
                #8  0x00007f07881fe05c kmgr_infer (libnrt.so.1 + 0x14805c)
                #9  0x00007f0788111cca nrt_infer (libnrt.so.1 + 0x5bcca)
                #10 0x00005622de0ff672 n/a (/home/ubuntu/dev/Kholinar/target/debug/deps/xla-3f4077dacd51f8e1 + 0x1d5672)

I've found that if I increase the size of the stack, it avoids the SIGSEGV: RUST_MIN_STACK=104857600 cargo test. This increases the stack size to 104MB. Note that I'm running a single test, nothing else is running on the Neuron devices at the same time.

Is it expected that nrt_execute might attempt to allocate large objects to the stack? When you try to execute the graphs attached to this issue, do you also hit the SIGSEGV? Thanks!

awsilya commented 1 year ago

@xanderdunn no, we should not require a crazy large stack. Let me try running your neff with one of our tools.

awsilya commented 1 year ago

@xanderdunn I was able to execute your neffs using our test tool. Incidentally, it shipped with our most recent release in aws-neuronx-tools package. While its main purpose is performance measurement it is also handy for running quick tests.

$ neuron-bench infer --fixed-instance-count 1 --enable-only-latency -n 2 --verbose 4 ./transformer_xla_benchmark_9598777143103386534.neff .... INFO[0037] Writing results file=/tmp/nb-results-260606486/transformer_xla_benchmark_9598777143103386534_dynamic_nc1_b1_i1_LIBMODE/info.json INFO[0037] Writing latencies file1=/tmp/nb-results-260606486/transformer_xla_benchmark_9598777143103386534_dynamic_nc1_b1_i1_LIBMODE/latency_data.json file2=/tmp/nb-results-260606486/transformer_xla_benchmark_9598777143103386534_dynamic_nc1_b1_i1_LIBMODE/nc_latency_data.json

transformer_xla_benchmark_9598777143103386534
+---+----+---------+---------+---------+-------+--------+--------+--------+--------+--------+---------+---------+-------+ B NC NC USED WEIGHTS MODE INF/S IRES/S L(1) L(50) L(99) NCL(1) NCL(50) NCL(99) %USER
1 1 1 dynamic LIBMODE 5.08 5.08 196844 196844 196844 175808 175808 175808 N/A
+---+----+---------+---------+---------+-------+--------+--------+--------+--------+--------+---------+---------+-------+

xanderdunn commented 1 year ago

Thank you! I'm taking off traveling but will try neuron-bench in a couple of days, this tool looks very useful.

This seems to indicate that the issue is either in my usage of NRT SDK, or some strangeness in the RUST->C FFI. I will investigate further.

awsilya commented 1 year ago

@xanderdunn I'm going to close this one but feel free to reopen if you find anything interesting. I glanced through our code and did not see any obvious issues. We had cases that generated large number of error notifications before and the code handled them correctly.

xanderdunn commented 1 year ago

Confirmed I do not see a SIGSEGV when running the same .neff with neuron-bench. Still investigating the cause in my code. The core dump shows that the segfault is happening inside libnrt's nrt_infer:

                #0  0x00007f07882c5d6b notification_consume_error_block (libnrt.so.1 + 0x20fd6b)
                #1  0x00007f07882c67e8 notification_consume_errors (libnrt.so.1 + 0x2107e8)
                #2  0x00007f07882dbaa6 exec_infer_wait_one (libnrt.so.1 + 0x225aa6)
                #3  0x00007f07882cb8ce kbl_infer_exec_wait (libnrt.so.1 + 0x2158ce)
                #4  0x00007f07881fd162 _Z9dlr_inferP14dlr_kelf_modelm10vtpb_rangePK2htPS2_P15kbl_output_info (libnrt.so.1 + 0x147162)
                #5  0x00007f0788213058 _Z10exec_modelPK14kelf_node_infomP11top_node_ioPN3tvm7runtime10grt_tensorE (libnrt.so.1 + 0x15d058)
                #6  0x00007f07882063c0 _ZN3tvm7runtime12GraphRuntime3RunEmPdS2_ (libnrt.so.1 + 0x1503c0)
                #7  0x00007f07881fc299 dlr_run_graph (libnrt.so.1 + 0x146299)
                #8  0x00007f07881fe05c kmgr_infer (libnrt.so.1 + 0x14805c)
                #9  0x00007f0788111cca nrt_infer (libnrt.so.1 + 0x5bcca)

So I must be setting it up / calling it differently than neuron-bench.

gdb backtrace:

(gdb) bt
#0  0x00007fb9c34abd6b in notification_consume_error_block (mla=mla@entry=0x7fb9c0cf8018, notif=notif@entry=0x7fb9c0cf8080, tpb=tpb@entry=true, end_ts=end_ts@entry=1909920831251,
    consume_all=consume_all@entry=false, error_count_array=error_count_array@entry=0x7fb9c2d05520) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:1295
#1  0x00007fb9c34ac7e8 in notification_consume_errors (mla=mla@entry=0x7fb9c0cf8018, tpb=tpb@entry=0x7fb9c0cf8060, end_ts=1909920831251, consume_all=false, error_count_array=error_count_array@entry=0x7fb9c2d05520)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:1445
#2  0x00007fb9c34c1aa6 in exec_infer_wait_one (mla=0x7fb9c0cf8018, tpb_idx=0, mod=mod@entry=0x7fb9bd1f62a0, inference_id=inference_id@entry=0, out_info=out_info@entry=0x7fb9c2d05520)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/exec.c:548
#3  0x00007fb9c34b18ce in kbl_infer_exec_wait (mod=0x7fb9bd1f62a0, inference_id=inference_id@entry=0, start_vtpb_id=start_vtpb_id@entry=0, tpb_count=1, compute_req_idx=<optimized out>,
    out_info=out_info@entry=0x7fb9c2d05520) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/tdrv.c:1397
#4  0x00007fb9c33e3162 in dlr_infer (dlr_mod=dlr_mod@entry=0x7fb9bcbb6250, inference_id=inference_id@entry=0, range=..., in_ifmap_set=in_ifmap_set@entry=0x7fb9bcba1970,
    out_ifmap_set=out_ifmap_set@entry=0x7fb9bc281df0, output_info=output_info@entry=0x7fb9c2d05520) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/kmgr/dlr.cpp:2242
#5  0x00007fb9c33e3623 in kmgr_infer (h_nn=h_nn@entry=..., in_set=in_set@entry=0x7fb9bcba1970, out_set=out_set@entry=0x7fb9bc281df0, loop_end_value=1, loop_end_value@entry=0)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/kmgr/dlr.cpp:1759
#6  0x00007fb9c32f7cca in nrt_infer (repeat_count=0, out_set=0x7fb9bc281df0, in_set=0x7fb9bcba1970, model=0x7fb9bcba12b0) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:48
#7  nrt_execute_repeat (model=model@entry=0x7fb9bcba12b0, input=input@entry=0x7fb9bcba1970, output=output@entry=0x7fb9bc281df0, repeat_count=repeat_count@entry=0)
    at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:69
#8  0x00007fb9c32f7ee8 in nrt_execute (model=0x7fb9bcba12b0, input=0x7fb9bcba1970, output=0x7fb9bc281df0) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:80

Does the calling of notification_consume_errors and notification_consume_error_block indicate that an error occurred within libnrt?