Closed xanderdunn closed 1 year ago
@xanderdunn no, we should not require a crazy large stack. Let me try running your neff with one of our tools.
@xanderdunn I was able to execute your neffs using our test tool. Incidentally, it shipped with our most recent release in aws-neuronx-tools package. While its main purpose is performance measurement it is also handy for running quick tests.
$ neuron-bench infer --fixed-instance-count 1 --enable-only-latency -n 2 --verbose 4 ./transformer_xla_benchmark_9598777143103386534.neff .... INFO[0037] Writing results file=/tmp/nb-results-260606486/transformer_xla_benchmark_9598777143103386534_dynamic_nc1_b1_i1_LIBMODE/info.json INFO[0037] Writing latencies file1=/tmp/nb-results-260606486/transformer_xla_benchmark_9598777143103386534_dynamic_nc1_b1_i1_LIBMODE/latency_data.json file2=/tmp/nb-results-260606486/transformer_xla_benchmark_9598777143103386534_dynamic_nc1_b1_i1_LIBMODE/nc_latency_data.json
transformer_xla_benchmark_9598777143103386534
+---+----+---------+---------+---------+-------+--------+--------+--------+--------+--------+---------+---------+-------+
B NC NC USED WEIGHTS MODE INF/S IRES/S L(1) L(50) L(99) NCL(1) NCL(50) NCL(99) %USER
1 1 1 dynamic LIBMODE 5.08 5.08 196844 196844 196844 175808 175808 175808 N/A
+---+----+---------+---------+---------+-------+--------+--------+--------+--------+--------+---------+---------+-------+
Thank you! I'm taking off traveling but will try neuron-bench
in a couple of days, this tool looks very useful.
This seems to indicate that the issue is either in my usage of NRT SDK, or some strangeness in the RUST->C FFI. I will investigate further.
@xanderdunn I'm going to close this one but feel free to reopen if you find anything interesting. I glanced through our code and did not see any obvious issues. We had cases that generated large number of error notifications before and the code handled them correctly.
Confirmed I do not see a SIGSEGV when running the same .neff with neuron-bench
. Still investigating the cause in my code. The core dump shows that the segfault is happening inside libnrt's nrt_infer
:
#0 0x00007f07882c5d6b notification_consume_error_block (libnrt.so.1 + 0x20fd6b)
#1 0x00007f07882c67e8 notification_consume_errors (libnrt.so.1 + 0x2107e8)
#2 0x00007f07882dbaa6 exec_infer_wait_one (libnrt.so.1 + 0x225aa6)
#3 0x00007f07882cb8ce kbl_infer_exec_wait (libnrt.so.1 + 0x2158ce)
#4 0x00007f07881fd162 _Z9dlr_inferP14dlr_kelf_modelm10vtpb_rangePK2htPS2_P15kbl_output_info (libnrt.so.1 + 0x147162)
#5 0x00007f0788213058 _Z10exec_modelPK14kelf_node_infomP11top_node_ioPN3tvm7runtime10grt_tensorE (libnrt.so.1 + 0x15d058)
#6 0x00007f07882063c0 _ZN3tvm7runtime12GraphRuntime3RunEmPdS2_ (libnrt.so.1 + 0x1503c0)
#7 0x00007f07881fc299 dlr_run_graph (libnrt.so.1 + 0x146299)
#8 0x00007f07881fe05c kmgr_infer (libnrt.so.1 + 0x14805c)
#9 0x00007f0788111cca nrt_infer (libnrt.so.1 + 0x5bcca)
So I must be setting it up / calling it differently than neuron-bench.
gdb backtrace:
(gdb) bt
#0 0x00007fb9c34abd6b in notification_consume_error_block (mla=mla@entry=0x7fb9c0cf8018, notif=notif@entry=0x7fb9c0cf8080, tpb=tpb@entry=true, end_ts=end_ts@entry=1909920831251,
consume_all=consume_all@entry=false, error_count_array=error_count_array@entry=0x7fb9c2d05520) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:1295
#1 0x00007fb9c34ac7e8 in notification_consume_errors (mla=mla@entry=0x7fb9c0cf8018, tpb=tpb@entry=0x7fb9c0cf8060, end_ts=1909920831251, consume_all=false, error_count_array=error_count_array@entry=0x7fb9c2d05520)
at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/notification.c:1445
#2 0x00007fb9c34c1aa6 in exec_infer_wait_one (mla=0x7fb9c0cf8018, tpb_idx=0, mod=mod@entry=0x7fb9bd1f62a0, inference_id=inference_id@entry=0, out_info=out_info@entry=0x7fb9c2d05520)
at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/exec.c:548
#3 0x00007fb9c34b18ce in kbl_infer_exec_wait (mod=0x7fb9bd1f62a0, inference_id=inference_id@entry=0, start_vtpb_id=start_vtpb_id@entry=0, tpb_count=1, compute_req_idx=<optimized out>,
out_info=out_info@entry=0x7fb9c2d05520) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/tdrv/tdrv.c:1397
#4 0x00007fb9c33e3162 in dlr_infer (dlr_mod=dlr_mod@entry=0x7fb9bcbb6250, inference_id=inference_id@entry=0, range=..., in_ifmap_set=in_ifmap_set@entry=0x7fb9bcba1970,
out_ifmap_set=out_ifmap_set@entry=0x7fb9bc281df0, output_info=output_info@entry=0x7fb9c2d05520) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/kmgr/dlr.cpp:2242
#5 0x00007fb9c33e3623 in kmgr_infer (h_nn=h_nn@entry=..., in_set=in_set@entry=0x7fb9bcba1970, out_set=out_set@entry=0x7fb9bc281df0, loop_end_value=1, loop_end_value@entry=0)
at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/kmgr/dlr.cpp:1759
#6 0x00007fb9c32f7cca in nrt_infer (repeat_count=0, out_set=0x7fb9bc281df0, in_set=0x7fb9bcba1970, model=0x7fb9bcba12b0) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:48
#7 nrt_execute_repeat (model=model@entry=0x7fb9bcba12b0, input=input@entry=0x7fb9bcba1970, output=output@entry=0x7fb9bc281df0, repeat_count=repeat_count@entry=0)
at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:69
#8 0x00007fb9c32f7ee8 in nrt_execute (model=0x7fb9bcba12b0, input=0x7fb9bcba1970, output=0x7fb9bc281df0) at /local/p4clients/pkgbuild-Zme53/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:80
Does the calling of notification_consume_errors
and notification_consume_error_block
indicate that an error occurred within libnrt?
When I attempt to call
nrt_execute
on either of these XLA graphs, I get aSIGSEGV
. These are Transformers with parameters (n_context, n_layers, d_model, n_heads):The core dumps were too large to attach to a GitHub issue but I can provide them if it would be useful.
The same code works to execute very similar graphs, for example this is the same Transformer model but with a smaller context size, and it runs without a SIGSEGV:
I am making this call from a Rust program that's calling the
nrt_execute
libnrt as a Foreign Function Interface based on the docs. The call is very simple:I ran these tests on a trn1.2xlarge instance.
Taking a look at one of the core dumps:
I've found that if I increase the size of the stack, it avoids the SIGSEGV:
RUST_MIN_STACK=104857600 cargo test
. This increases the stack size to 104MB. Note that I'm running a single test, nothing else is running on the Neuron devices at the same time.Is it expected that
nrt_execute
might attempt to allocate large objects to the stack? When you try to execute the graphs attached to this issue, do you also hit the SIGSEGV? Thanks!