Closed BinZlP closed 3 years ago
Hi!
AIFM uses SIGUSR2 and SIGUSR1 to force task preemption and GC triggering, so it's expected to see that in GDB. You can suppress them using handle SIGUSR2 nostop noprint
and handle SIGUSR1 nostop noprint
.
Could you please rerun the experiment using GDB and show me the call stack when segfault is triggered?
Thanks for your explanation!
This is the call stack when segfault has occurred:
Thread 2 "main" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff954fe700 (LWP 7531)] far_memory::GCParallelMarker::slave_fn (this=
, tid= ) at ../../..//src/manager.cpp:434 434 if (!ptr->meta().is_shared()) { (gdb) bt 0 far_memory::GCParallelMarker::slave_fn (this=
, tid= ) at ../../..//src/manager.cpp:434
1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffc26cf6fe0)
at /usr/include/c++/9/bits/std_function.h:683
2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffc26cf6fd0) at thread.cc:15
3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128
4 0x0000000000000000 in ?? ()
... and also, another problem raised that the program sometimes stops during reading the file. Like below:
Have read 72351744 bytes. Have read 73400320 bytes. Have read 74448896 bytes. Have read 75497472 bytes. Have read 76546048 bytes. Have read 77594624 bytes. Have read 78643200 bytes. Have read 79691776 bytes. Have read 80740352 bytes. Have read 81788928 bytes. Have read 82837504 bytes. ( ... stop and not proceed )
It looked like deadlock or something, so I stopped the program with Ctrl+C. This is the call stack when killed by SIGINT after the program stopped:
Thread 1 "main" received signal SIGINT, Interrupt. 0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78 78 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt
0 0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78
1 0x00005555555a396e in kthread_yield_to_iokernel () at runtime/kthread.c:118
2 kthread_park (voluntary=
) at runtime/kthread.c:244 3 0x00005555555a4dff in schedule () at ./inc/runtime/preempt.h:53
4 0x00005555555a3e80 in ?? () at runtime/sched.c:175
5 0x0000000000000000 in ?? ()
Thanks!
Also, I tried running test code while reducing the local cache size from 16GB to 1GB, but the problems occurred from 14GB. From 14GB, sometimes it shortly stops while reading and stops at the end (or more early) of the reading stage without moving to the compression stage. This case, the call stack is same with prev. comment's last one.
Lastly, sometimes segfault triggered after the reading is completed. Here's the log of the case:
Have read 998244352 bytes. Have read 999292928 bytes. [ 7.498414] CPU 07| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 1919 times [ 7.498425] CPU 07| <3> txq full [ 9.267466] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 4930 times [ 9.267480] CPU 02| <3> txq full [ 11.240640] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 9693 times [ 11.240655] CPU 02| <3> txq full [ 13.067836] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 8229 times [ 13.067850] CPU 02| <3> txq full Thread 2 "main" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff954fe700 (LWP 10236)] far_memory::GCParallelMarker::slave_fn (this=
, tid= ) at ../../..//src/manager.cpp:434
And this is the call stack of the case: (gdb) bt
0 far_memory::GCParallelMarker::slave_fn (this=
, tid= ) at ../../..//src/manager.cpp:434
1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffbec176fe0)
at /usr/include/c++/9/bits/std_function.h:683
2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffbec176fd0) at thread.cc:15
3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128
4 0x0000000100000000 in ?? ()
5 0x0000100000082f00 in ?? ()
6 0x00007ffbf19e6f01 in ?? ()
7 0x00007fff88011b18 in ?? ()
8 0x000055555556ffc0 in ?? () at ../../..//inc/internal/parallel.ipp:77
9 0x000055555556ffa0 in ?? () at /usr/include/c++/9/bits/stl_deque.h:273
10 0x0000000000000000 in ?? ()
Thanks for your help :)
Hi, thanks for your information! This looks like a bug, which is intolerable. I will try to reproduce and fix it once I get a chance. Should be soon.
Hi, I just ran fig.11a code with local_ram=14G for 100 times on my cloudlab instance (the one mentioned in README), and doesn't observe sigfault or deadlock. Maybe what you are facing now is caused by some misconfiguration or by the actual bugs that are hard to trigger in my instance. In either case, I'd be happy to help you if I'm able to ssh into your instance. You can send me an email (zainruan@mit.edu).
Thanks for your kindness! Then I'll send you an email :D
Hi Han, the commit above should fix everything. Free feel to reopen this issue if you find anything wrong.
Hi Zain,
Okay, thanks for your quick fix! Have a good weekend :D
Best regards, Han
2021년 6월 19일 (토) 오후 1:46, Zain Ruan @.***>님이 작성:
Hi Han, the commit above should fix everything. Free feel to reopen this issue if you find anything wrong.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AIFM-sys/AIFM/issues/7#issuecomment-864354969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCTMPAR5KPEO7KFM5OY42TTTQOLHANCNFSM46AB2YJA .
Hello, I'm trying to reproducing your experiments results on my own servers, but I encountered segmentation fault error while reproducing figure 11a (compressing array with snappy).
Here's my servers' specification:
I modified
setup/run.sh
as my compute node's SSD and executed it. It seems like there's no uncompressed file in/mnt
, so I manually decompressedenwik9.zip
in/mnt
. After that, I ranaifm/run.sh
, and it looked like successfully run with 256MB of local_ram (there're elapsed time and "Force exiting..." message inlog.256
). But the script didn't process to next step(512MB of local_ram), so I terminated it and executed with specified local_ram, one by one manually. Then I encountered segmentation fault while reading uncompressed file.Here's the log of the execution which was terminated by segmentation fault:
and dmesg printed this:
For debugging, I modified
Makefile
to give-g
compile option, butmain
didn't create the core files. So I tried to use gdb to handle this problem, but after created threads there're some other SIGUSR2 errors occurred like below:I don't have any idea what did I missed. Could you give me some hints to fix this? I'll appreciate your reply!