Open nnurlan008 opened 1 year ago
Hi, can you use GDB to check where does the actual segmentation fault happen?
I have debugged using gdb tool I noticed that the segmentation fault comes from line 276. As far as I understand the rdma connection fails and gives segmentation fault because when I comment out that line, the output is as follows:
(gdb) continue Continuing. [main.cc:272] use page mem: 5368709120 [allocator_master.hpp:50] allocator master register memory: 2145385472 [main.cc:278] Memory layout of the server: | meta 0x0:0x0+2MB | page 0x200000:0x200000+5GB | Heap 0x140200400:0x140200400+1.99805GB | [New Thread 0x7ffe351ff640 (LWP 39895)] [main.cc:306] server wait for threads to join ... [main.cc:308] Start populating DB: ycsb with num: 10000 [main.cc:167] B+tree load done, leaf sz: 384 rdma base: 140729789710336 [main.cc:183] start training!
Thread 1 "fserver" received signal SIGSEGV, Segmentation fault.
fstore::ModelConfig::load_internal (handle=std::shared_ptr
Thanks for your information. This function will register a callback to the client's RDMA connection request, but it seems unclear to me where the detailed segmentation fault happens.
Could you please use coredump to trace the real segmentation fault place?
I.e., On our servers, we first use ulimit -c unlimited
and then gdb fserver core
where the core is generated if a segmentation fault happens, to find which line of code cause the segmentation fault.
the result of the gdb is as follows:
I have used ulimit -c unlimited and gdb fserver core commands and then backtraced to see the line of fault cause
Thanks and regards
Hi, thanks for your feedback! It appears that the region memory has been corrupted. I noticed that one thing seems to be the problem, due to the memory configuration:
[main.cc:271] use page mem: 5368709120
[allocator_master.hpp:50] allocator master register memory: 2145385472
It seems that the RDMA heap allocated is smaller than the allocated memory. Can you configure the RDMA heap to be large to see whether the problem has been fixed?
I have increased the RDMA heap size to 8GB but still the error exists:
It's quite strange. From the gdb information, it seems that the memory of global region manager in server/main.cc is corrupted, which I have never met before.
static RegionManager rm((char*)alloc_huge_page(global_mem_sz, 2 * MB),
Since it is the segfault happens at the early initialization phase of the server, could you check what goes wrong by checking the memory state of the rm
, or could you give me more detailed build environment (e.g., g++ version) so I can try to reproduce the problem?
when I run the following command: "./fserver -db_type ycsb -model-config=ycsb-model.toml", I get the following error:
[main.cc:257] use configuration: Server config: using memory for leaf nodes: 5.0000 GB; allocated RDMA heap size: 2.0000 GB; server communication type: ud. of config file: server/config.toml [memory_util.hpp:37] huge page alloc failed! [memory_util.hpp:45] use default malloc for allocating page [main.cc:271] use page mem: 5368709120 [allocator_master.hpp:50] allocator master register memory: 2145385472 Segmentation fault (core dumped)
Does anybody know why this error happens?
Thanks in advance