Segmentation fault(core dump) while running inference

I met some errors when I deploy my application to a docker container and run inference(load model successfully)

enviroment:

instance_type: inf1.6xlarge
serving framework: DJL 1.20.0
jdk version: java 11

main steps in dockerfile (mofied with reference: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/inference/Dockerfile-libmode.html#libmode-dockerfile):

FROM registry2.nb.com/ubuntu:20.04
...
# Configure Linux for Neuron repository updates
RUN apt-get install -y gnupg2 pciutils
RUN echo "deb https://apt.repos.neuron.amazonaws.com bionic main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -
# Update OS packages
RUN apt-get update -y && apt-get install -y aws-neuronx-dkms aws-neuronx-tools python3.8 python3-pip
ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
RUN pip3 install torch-neuron==1.12.1.* --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11

run with docker command: --device /dev/neuron0 --device /dev/neuron1 --device /dev/neuron2 --device /dev/neuron3 and I can see neuron devices in my container:

I have configured enviroment variables so that DJL(serving framework) can load libtorchneuron.so successfully, and it can be seen the neuron-traced model was loaded successfully, but failed when warmup( make some test cases running inference)

I use gdb to debug the core dump file and get:

Core was generated by `java -Xmn30G -Xms44G -Xmx44G -verbose:gc -XX:+PrintGCDetails -XX:+UseConcMarkSw'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  vring_build_vring_pages_from_template (page_template=page_template@entry=0x7fa3f123b420, mr_to_name_map=mr_to_name_map@entry=0x7f9715a01630, shared_io_set=shared_io_set@entry=0x0, input_set=input_set@entry=0x7f96ff441380,
    output_set=output_set@entry=0x7f96fe9d2b50, tdram_channel=tdram_channel@entry=0, axi_port=0, is_tx=true) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/vring.c:847
847 /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/vring.c: No such file or directory.
[Current thread is 1 (Thread 0x7fa3f123f700 (LWP 19))]
(gdb) bt full
#0  vring_build_vring_pages_from_template (page_template=page_template@entry=0x7fa3f123b420, mr_to_name_map=mr_to_name_map@entry=0x7f9715a01630, shared_io_set=shared_io_set@entry=0x0, input_set=input_set@entry=0x7f96ff441380,
    output_set=output_set@entry=0x7f96fe9d2b50, tdram_channel=tdram_channel@entry=0, axi_port=0, is_tx=true) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/vring.c:847
        ret = NRT_SUCCESS
        __FUNCTION__ = "vring_build_vring_pages_from_template"
        found_tensors = <optimized out>
        dummy_page = <error reading variable dummy_page (value of type `vring_desc_page_t' requires 1048592 bytes, which is more than max-value-size)>
        curr_page = <optimized out>
        mod_mask = <optimized out>
        __PRETTY_FUNCTION__ = "vring_build_vring_pages_from_template"
#1  0x00007f984063ebc2 in ddrs_build_ring_from_template (mem_location=TONGA_DRAM, ring=<synthetic pointer>, shared_io_set=0x0, output_set=0x7f96fe9d2b50, input_set=0x7f96ff441380, mr_to_name_map=0x7f9715a01630, name=0x7f97156f69b5 "qPoolIn0_0",
    type=DMA_RING_TYPE_TX, axi_port=0, tdram_region=0, tdram_channel=0, tpb_idx=0, mla=0x7f9710000038, template=0x7f9718839840) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/dma_dynamic_ring_set.c:198
        ret = <optimized out>
        vring_descriptors = {used_desc_count = 808, head = 0x7f96ffafacc0, tail = 0x7f96ffafacc0, is_template = true, is_shared_only = false, max_var_id = 23}
        ret = <optimized out>
        vring_descriptors = <optimized out>
#2  ddrs_build_ring_and_add_to_ring_set (template=0x7f9718839840, queue=queue@entry=0x7f97156f6990, mla=mla@entry=0x7f9710000038, tpb_idx=tpb_idx@entry=0, tdram_channel=0, axi_port=0, type=DMA_RING_TYPE_TX, mr_to_name_map=0x7f9715a01630,
    input_set=0x7f96ff441380, output_set=0x7f96fe9d2b50, shared_io_set=0x0, dynamic_ring_set=0x7f96fe8a27d0) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/dma_dynamic_ring_set.c:295
        ret = NRT_SUCCESS
        tdram_region = <optimized out>
        ring = <optimized out>
        __FUNCTION__ = "ddrs_build_ring_and_add_to_ring_set"
#3  0x00007f984063f75f in ddrs_build_dma_rings_for_queue (dynamic_ring_set=0x7f96fe8a27d0, shared_io_set=0x0, output_set=0x7f96fe9d2b50, input_set=0x7f96ff441380, mr_to_name_map=0x7f9715a01630, tpb_idx=0, queue_bundle=0x7f97156f6838,
    mla=0x7f9710000038) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/dma_dynamic_ring_set.c:371
        type = <optimized out>
        vring_descriptors = <optimized out>
        i = <optimized out>
        queue = 0x7f97156f6990
        template = 0x7f9718839740
--Type <RET> for more, q to quit, c to continue without paging--
        ring_types = {DMA_RING_TYPE_TX, DMA_RING_TYPE_RX}
        ring_type_count = 2
        qidx = 0
        ret = NRT_SUCCESS
        ret = <optimized out>
        qidx = <optimized out>
        queue = <optimized out>
        template = <optimized out>
        ring_types = <optimized out>
        ring_type_count = <optimized out>
        i = <optimized out>
        type = <optimized out>
        vring_descriptors = <optimized out>
#4  ddrs_build_dma_rings (mla=0x7f9710000038, qset=qset@entry=0x7f97156f06d0, tpb_idx=0, mr_to_name_map=0x7f9715a01630, input_set=input_set@entry=0x7f96ff441380, output_set=output_set@entry=0x7f96fe9d2b50, shared_io_set=0x0, shared_ring_set=0x0,
    compute_resource=0x7f96fe67b540) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/dma_dynamic_ring_set.c:512
        rc_node = <optimized out>
        queue_bundle = 0x7f97156f6838
        lookup_key = 0x0
        lookup_key_nbytes = <optimized out>
        queue_ring_set = 0x7f96fe8a27d0
        cache_hit = <optimized out>
        i = 4
        rset_idx = <optimized out>
        ret = NRT_SUCCESS
        __PRETTY_FUNCTION__ = "ddrs_build_dma_rings"
        __FUNCTION__ = "ddrs_build_dma_rings"
#5  0x00007f984063840d in kbl_compute_build_compute_resources (vtpb_start=vtpb_start@entry=0, tpb_count=tpb_count@entry=1, h_model=..., input_set=input_set@entry=0x7f96ff441380, output_set=output_set@entry=0x7f96fe9d2b50, shared_io_set=0x0,
    resource_tracker=0x7fa3f123b6a0) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/tdrv/tdrv.c:999
--Type <RET> for more, q to quit, c to continue without paging--
        shared_ring_set = <optimized out>
        num_dynamic_ring_sets = 2
        ioqs_resource = 0x7f96fe67b540
        vtpb_id = 0
        sring_sets = <optimized out>
        i = 0
        ret = <optimized out>
        mla = 0x7f9710000038
        tpb = 0x7f9710000080
        mod = 0x7f97156f06d0
        __FUNCTION__ = "kbl_compute_build_compute_resources"
        __PRETTY_FUNCTION__ = "kbl_compute_build_compute_resources"
#6  0x00007f98405800a1 in dlr_infer (dlr_mod=0x7f9718a651a0, inference_id=inference_id@entry=0, range=..., in_ifmap_set=<optimized out>, out_ifmap_set=0x7f971754dd90, output_info=output_info@entry=0x7fa3ef112228)
    at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/kmgr/dlr.cpp:1872
        ret = <optimized out>
        mod = 0x7f97156f06d0
        compute_idx = 0
        io_idx = 0
        nd_idx = 0
        nc_idx = 0
        shared_io_set = <optimized out>
        in_ifmap_set_dev = 0x7f96ff441380
        out_ifmap_set_dev = 0x7f96fe9d2b50
        host_io_tensor_cache = <optimized out>
        in_fmaps = 0x7f96ff441380
        out_fmaps = 0x7f96fe9d2b50
        vtpb_start = <optimized out>
        vtpb_end = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
        __FUNCTION__ = "dlr_infer"
        valid_io = <optimized out>
        __PRETTY_FUNCTION__ = "NRT_STATUS dlr_infer(dlr_kelf_model_t*, uint64_t, vtpb_range_t, const kbl_feature_map_set_t*, kbl_feature_map_set_t*, kbl_output_info_t*)"
        resource_tracker = {mem_tracker = {count = 0, head = 0x0, tail = 0x0}, ioqs_resources = 0x7f96feeb8160, tpb_count = 1}
#7  0x00007f9840594be8 in exec_model (kelf_node_info=kelf_node_info@entry=0x55f313988350, inference_id=inference_id@entry=0, io=0x7fa3ef112218, data_entry=0x7f9715aa4260)
    at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/kmgr/tvm/dmlctvm/src/runtime/tonga_op.cc:65
        ret = <optimized out>
#8  0x00007f9840587f50 in tvm::runtime::GraphRuntime::Run (this=0x7fa3ee099220, inference_id=inference_id@entry=0, node_latencies=0x0, device_latency=device_latency@entry=0x7fa3f123b988)
    at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/kmgr/tvm/dmlctvm/src/runtime/graph_runtime.cc:252
        o = 0x55f313988350
        fp = 0x7f9840594b90 <exec_model(kelf_node_info const*, unsigned long, top_node_io*, tvm::runtime::grt_tensor*)>
        r = <optimized out>
        node_latency = {ts_start = {tv_sec = 1671006531, tv_nsec = 42548434}, ts_end = {tv_sec = 0, tv_nsec = 140286494522368}}
        op = <optimized out>
        collect_latency = <optimized out>
        i = 18
        ret = 0
        __FUNCTION__ = "Run"
#9  0x00007f984057f2c9 in dlr_run_graph (resp=0x7fa3f123b960, graph_index=<optimized out>, inference_id=0, mod=0x7fa3ee0b5450) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/kmgr/dlr.cpp:1659
        r = <optimized out>
        ret = NRT_FAILURE
        gr = <optimized out>
        ret = <optimized out>
        gr = <optimized out>
        r = <optimized out>
        err = <optimized out>
#10 run_inference (mod=mod@entry=0x7fa3ee0b5450, inference_id=inference_id@entry=0, graph_index=graph_index@entry=0, in_set=in_set@entry=0x7f96fe88f2d0, out_set=out_set@entry=0x7f96ff4684c0, resp=resp@entry=0x7fa3f123b960)
    at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/kmgr/dlr.cpp:1141
--Type <RET> for more, q to quit, c to continue without paging--
        ret = <optimized out>
#11 0x00007f984058115c in kmgr_infer (h_nn=h_nn@entry=..., in_set=in_set@entry=0x7f96fe88f2d0, out_set=out_set@entry=0x7f96ff4684c0, loop_end_value=1, loop_end_value@entry=0)
    at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/kmgr/dlr.cpp:1576
        ret = NRT_INVALID
        infer_wait = {tv_sec = 1671006591, tv_nsec = 34194492}
        resp = {combined_output_info = {err = {num_errors_per_class = {0, 0, 0, 0, 0, 0}}, tpb_duration = 0, tpb_cc_duration = 0}, mla_time = 0, ts_start = {tv_sec = 1671006531, tv_nsec = 34194492}, ts_done = {tv_sec = 0, tv_nsec = 0}, start_nd = 0,
          start_nc = 0}
        inference_id = 0
        mod = 0x7fa3ee0b5450
        __FUNCTION__ = "kmgr_infer"
        plat = <optimized out>
        nd_idx = 0
        nc_idx = 0
        graph_index = 0
        infer_count = <optimized out>
        n = <optimized out>
#12 0x00007f98404967aa in nrt_infer (repeat_count=0, out_set=0x7f96ff4684c0, in_set=0x7f96fe88f2d0, model=0x7f97106ff680) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:49
        input_set = 0x7f96fe88f2d0
        output_set = 0x7f96ff4684c0
        ifmap_count = 8
        ret = NRT_SUCCESS
        instance_index = <optimized out>
        model_handle = {id = 10001}
        input_set = <optimized out>
        output_set = <optimized out>
        ifmap_count = <optimized out>
        ret = <optimized out>
        instance_index = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
        model_handle = <optimized out>
#13 nrt_execute_repeat (model=model@entry=0x7f97106ff680, input=input@entry=0x7f96fe88f2d0, output=output@entry=0x7f96ff4684c0, repeat_count=repeat_count@entry=0) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:70
        lock = {took_lock = false}
        __FUNCTION__ = "nrt_execute_repeat"
        result = <optimized out>
#14 0x00007f98404969c8 in nrt_execute (model=0x7f97106ff680, input=0x7f96fe88f2d0, output=0x7f96ff4684c0) at /local/p4clients/pkgbuild-MtVpO/workspace/src/KaenaRuntime/nrt/nrt_exec.cpp:81
        lock = {took_lock = true}
        __FUNCTION__ = "nrt_execute"
#15 0x00007f9840ab2345 in neuron::infer(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::shared_ptr<neuron::NeuronModel>, int) ()
   from /usr/local/lib/python3.8/dist-packages/torch_neuron/lib/libtorchneuron.so
No symbol table info available.
#16 0x00007f9840ab4657 in neuron::forward_model(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::shared_ptr<neuron::NeuronModel>) () from /usr/local/lib/python3.8/dist-packages/torch_neuron/lib/libtorchneuron.so
No symbol table info available.
#17 0x00007f9840bb982e in neuron::forward_v2(std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<neuron::Model, c10::detail::intrusive_target_default_null_type<neuron::Model> >) ()
   from /usr/local/lib/python3.8/dist-packages/torch_neuron/lib/libtorchneuron.so
No symbol table info available.
#18 0x00007f9840bc0282 in auto neuron::forward_v2_tuple<1ul>(std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<neuron::Model, c10::detail::intrusive_target_default_null_type<neuron::Model> >) ()
   from /usr/local/lib/python3.8/dist-packages/torch_neuron/lib/libtorchneuron.so
No symbol table info available.
#19 0x00007f9840be6463 in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor> (*)(std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<neuron::Model, c10::detail::intrusive_target_default_null_type<neuron::Model> >), std::tuple<at::Tensor>, c10::guts::typelist::typelist<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<neuron::Model, c10::detail::intrusive_target_default_null_type<neuron::Model> > > >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) () from /usr/local/lib/python3.8/dist-packages/torch_neuron/lib/libtorchneuron.so
No symbol table info available.
#20 0x00007f97e8cdcc69 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const () from /root/.djl.ai/pytorch/1.12.1-20220818-cpu-linux-x86_64/libtorch_cpu.so
No symbol table info available.
#21 0x00007f97eb7f8e23 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) () from /root/.djl.ai/pytorch/1.12.1-20220818-cpu-linux-x86_64/libtorch_cpu.so
No symbol table info available.
#22 0x00007f97eb7e71b2 in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) () from /root/.djl.ai/pytorch/1.12.1-20220818-cpu-linux-x86_64/libtorch_cpu.so

Key points

I have loaded my model and running test cases successfully with python code, and I can guarantee that my dummy test requests are the same (both shape and dtypes) between java and python code.

So I'd like to know if you have ideas about this error? Thanks.

aws-neuron / aws-neuron-sdk