Closed linbaiwpi closed 2 weeks ago
Thank you for your comment. I addressed the direct memory leakage of the mem_fetch object in DramRamulator2. When you build ONNXim in debug mode, it checks for memory sanitization. This slows down the simulation, so if you need faster simulation, please build ONNXim in Release mode.
Thank you for your comment. I addressed the direct memory leakage of the mem_fetch object in DramRamulator2. When you build ONNXim in debug mode, it checks for memory sanitization. This slows down the simulation, so if you need faster simulation, please build ONNXim in Release mode.
Thanks for your quick reply.
1) if I understand correctly, this memory sanitization won't affect the correctness of example run of ONNXim?
2) also, I tried to compile it in Release mode and interestingly, I got the following new fault.
root@0c4c6e0e9971:/workspace/ONNXim/build# ./bin/Simulator --config /workspace/ONNXim/configs/systolic_ws_128x128_c4_simple_noc_tpuv4.json --model /workspace/ONNXim/example/models_list.json [2024-09-04 20:21:53.259] [info] CPU 0: Partition 0 [2024-09-04 20:21:53.259] [info] CPU 1: Partition 0 [2024-09-04 20:21:53.259] [info] CPU 2: Partition 0 [2024-09-04 20:21:53.259] [info] CPU 3: Partition 0 [2024-09-04 20:21:53.259] [info] Running in default mode [2024-09-04 20:21:53.260] [info] Simulator Configuration: [2024-09-04 20:21:53.260] [info] [Core 0] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB [2024-09-04 20:21:53.260] [info] [Core 1] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB [2024-09-04 20:21:53.260] [info] [Core 2] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB [2024-09-04 20:21:53.260] [info] [Core 3] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB [2024-09-04 20:21:53.260] [info] DRAM Bandwidth 614 GB/s [2024-09-04 20:21:53.260] [info] Ramulator2 config: /workspace/ONNXim/configs/../configs/ramulator2_configs/HBM2.yaml [2024-09-04 20:21:53.267] [info] Initialize SimpleInterconnect [2024-09-04 20:21:53.267] [info] No mapping file path : /workspace/ONNXim/models/resnet18/resnet18.mapping [2024-09-04 20:21:53.267] [info] Register model: resnet18 [2024-09-04 20:21:53.267] [info] Model Name [2024-09-04 20:21:53.298] [info] ======Start Simulation===== Floating point exception (core dumped)
Hi @linbaiwpi , thanks for your issue report
if I understand correctly, this memory sanitization won't affect the correctness of example run of ONNXim?
No, it will not affect the correctness.
also, I tried to compile it in Release mode and interestingly, I got the following new fault
Can you make sure your repository is currently up to date with the latest master version?
If you're still getting the error, I'd appreciate it if you could attach the systolic_ws_128x128_c4_simple_noc_tpuv4.json
and models_list.json
you're using so I can reproduce it.
Hi @linbaiwpi , thanks for your issue report
if I understand correctly, this memory sanitization won't affect the correctness of example run of ONNXim?
No, it will not affect the correctness.
also, I tried to compile it in Release mode and interestingly, I got the following new fault
Can you make sure your repository is currently up to date with the latest master version?
If you're still getting the error, I'd appreciate it if you could attach the
systolic_ws_128x128_c4_simple_noc_tpuv4.json
andmodels_list.json
you're using so I can reproduce it.
I pull the newest commit [ac10777] and re-build using the following commands:
$ mkdir build && cd build
$ conan install ..
$ cmake ..
$ make -j8
And then run the Simulator
$ ./build/bin/Simulator --config ./configs/systolic_ws_128x128_c4_simple_noc_tpuv4.json --model ./example/models_list.json
Still I got the same segfault
[2024-09-04 23:29:59.269] [info] CPU 0: Partition 0
[2024-09-04 23:29:59.269] [info] CPU 1: Partition 0
[2024-09-04 23:29:59.269] [info] CPU 2: Partition 0
[2024-09-04 23:29:59.269] [info] CPU 3: Partition 0
[2024-09-04 23:29:59.269] [info] Running in default mode
[2024-09-04 23:29:59.269] [info] Simulator Configuration:
[2024-09-04 23:29:59.269] [info] [Core 0] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB
[2024-09-04 23:29:59.269] [info] [Core 1] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB
[2024-09-04 23:29:59.269] [info] [Core 2] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB
[2024-09-04 23:29:59.269] [info] [Core 3] Systolic Array Throughput: 131072 GFLOPS, Spad size: 32768 KB, Accumulator size: 4096 KB
[2024-09-04 23:29:59.269] [info] DRAM Bandwidth 614 GB/s
[2024-09-04 23:29:59.269] [info] Ramulator2 config: /workspace/ONNXim/configs/../configs/ramulator2_configs/HBM2.yaml
[2024-09-04 23:29:59.274] [info] Initialize SimpleInterconnect
[2024-09-04 23:29:59.274] [info] No mapping file path : /workspace/ONNXim/models/resnet18/resnet18.mapping
[2024-09-04 23:29:59.274] [info] Register model: resnet18
[2024-09-04 23:29:59.274] [info] Model Name
[2024-09-04 23:29:59.295] [info] ======Start Simulation=====
Floating point exception (core dumped)
I pull the newest commit [[ac10777](https://github.com/PSAL-09-04 23:29:59.274] [info] Model Name[2024-09-04 23:29:59.295] [info] ======Start Simulation=====
Floating point exception (core dumped)
Could you attach the your systolic_ws_128x128_c4_simple_noc_tpuv4.json
and models_list.json
, so I can reproduce it?
systolic_ws_128x128_c4_simple_noc_tpuv4.json
Actually the systolic_ws_128x128_c4_simple_noc_tpuv4.json
and models_list.json
are both from the latest commit in main branch. Nothing has been changed on my side. I also attached them to this post. Please check.
systolic_ws_128x128_c4_simple_noc_tpuv4.json models_list.json
I tested the latest version in a docker environment and failed to reproduce your issue.
Can you tell us where in the source code the exception is occurring? If a Core dump file has been generated, you can check it out.
I tested the latest version in a docker environment and failed to reproduce your issue.
Can you tell us where in the source code the exception is occurring? If a Core dump file has been generated, you can check it out.
Hi, here I pasted the backtrace of float point exception.
(gdb) backtrace
#0 0x00005555555de1e1 in MappingTable::_calc_conv_mapping(bool, int, int, int, bool, bool, bool, int, int, int, int, int, int, int, int, int) [clone .constprop.1] ()
#1 0x00005555555e392d in MappingTable::calc_conv_mapping(Mapping::LoopCounts&) ()
#2 0x00005555555e45dc in MappingTable::conv_mapping(Mapping::LoopCounts&) ()
#3 0x00005555555e555b in MappingTable::fallback_mapping(Mapping::LoopCounts&) ()
#4 0x000055555565f1e8 in ConvWS::initialize_tiles(MappingTable&) ()
#5 0x00005555555ecd38 in Model::initialize_model(std::vector<std::unique_ptr<Tensor, std::default_delete<Tensor> >, std::allocator<std::unique_ptr<Tensor, std::default_delete<Tensor> > > >&) ()
#6 0x00005555555f4d8e in Simulator::handle_model() ()
#7 0x00005555555f5f97 in Simulator::cycle() ()
#8 0x00005555555b5db3 in main ()
I found the variable named _dim
in class MappingTable was not initialized. What confused me is, this private member _dim
is used in function MappingTable::_calc_conv_mapping
but no initialization. In the meantime, variable with the same name _dim
was defined in MappingTable::calc_conv_mapping
and MappingTable::gemm_mapping
. In these two functions, _dim
means the core_height in hardware config, which is the dimension of hardware PE array.
Please correct me if my statement is wrong.
Best
Thank you for taking the time to help us debug.
As you said, _calc_conv_mapping
was using an uninitialized member variable.
I've pushed a commit that fixes that issue.
I really appreciate you finding the bug and letting me know!
Thank you for your active reply and this solved my issue. I will close this issue.
Hi, I build the source from scratch and then run the example command, but got more than 1000 lines memory leaking info. Do you have a sense what leads to this?
when I have done:
docker build . -t onnxim
./build/bin/Simulator --config ./configs/systolic_ws_128x128_c4_simple_noc_tpuv4.json --model ./example/models_list.json
I actually got the model running completely (the first 2 lines of the print out listed below). But it seems in the end a memory leakage check has been performed.