AIFM-sys / AIFM

AIFM: High-Performance, Application-Integrated Far Memory
MIT License
105 stars 35 forks source link

Segfault while reproducing fig11a #7

Closed BinZlP closed 3 years ago

BinZlP commented 3 years ago

Hello, I'm trying to reproducing your experiments results on my own servers, but I encountered segmentation fault error while reproducing figure 11a (compressing array with snappy).

Here's my servers' specification:

I modified setup/run.sh as my compute node's SSD and executed it. It seems like there's no uncompressed file in /mnt, so I manually decompressed enwik9.zip in /mnt. After that, I ran aifm/run.sh, and it looked like successfully run with 256MB of local_ram (there're elapsed time and "Force exiting..." message in log.256). But the script didn't process to next step(512MB of local_ram), so I terminated it and executed with specified local_ram, one by one manually. Then I encountered segmentation fault while reading uncompressed file.

Here's the log of the execution which was terminated by segmentation fault:

CPU 07| <5> cpu: detected 16 cores, 1 nodes CPU 07| <5> time: detected 3912 ticks / us [ 0.000533] CPU 07| <5> loading configuration from '/home/dcslab/aifm_org/AIFM/aifm/configs/client_noht.config' [ 0.000603] CPU 04| <3> < 1 guaranteed kthreads is not recommended for networked apps [ 0.012926] CPU 04| <5> net: started network stack [ 0.012949] CPU 04| <5> net: using the following configuration: [ 0.012952] CPU 04| <5> addr: 10.0.0.163 [ 0.012955] CPU 04| <5> netmask: 255.255.255.0 [ 0.012957] CPU 04| <5> gateway: 10.0.0.1 [ 0.012960] CPU 04| <5> mac: B2:DF:D3:AC:F3:46 [ 0.225713] CPU 04| <5> thread: created thread 0 [ 0.225837] CPU 04| <5> spawning 7 kthreads [ 0.225997] CPU 06| <5> thread: created thread 1 [ 0.226033] CPU 15| <5> thread: created thread 2 [ 0.226169] CPU 02| <5> thread: created thread 3 [ 0.226347] CPU 13| <5> thread: created thread 4 [ 0.226512] CPU 09| <5> thread: created thread 5 [ 0.226584] CPU 12| <5> thread: created thread 6 [ 1.280757] CPU 02| <3> txq full [ 2.280827] CPU 06| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 28131 times [ 2.280855] CPU 06| <3> txq full [ 3.286459] CPU 06| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 20166 times [ 3.286488] CPU 06| <3> txq full [ 4.287735] CPU 07| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 30077 times [ 4.287760] CPU 07| <3> txq full Have read 1048576 bytes. Have read 2097152 bytes. Have read 3145728 bytes. Have read 4194304 bytes. [ 5.476078] CPU 04| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 17831 times [ 5.476099] CPU 04| <3> txq full Have read 5242880 bytes. Have read 6291456 bytes. Have read 7340032 bytes. Have read 8388608 bytes. ( ... read logs and some txq full messages ... ) Have read 49283072 bytes. Have read 50331648 bytes. Segmentation fault (core dumped)

and dmesg printed this:

[152719.113987] main[25957]: segfault at 697475746974 ip 0000556d9e173d57 sp 00007f7a7c2def20 error 4 in main[556d9e15d000+2ce000] [152719.113996] Code: 84 c0 74 ef 41 80 3e ff 74 30 41 8b 45 00 48 c7 44 24 50 00 00 00 00 89 44 24 50 41 0f b7 45 04 66 89 44 24 54 48 8b 4c 24 50 <48> 8b 01 f6 c4 02 0f 85 dd 02 00 00 80 49 02 01 80 fb 07 0f 86 42

For debugging, I modified Makefile to give -g compile option, but main didn't create the core files. So I tried to use gdb to handle this problem, but after created threads there're some other SIGUSR2 errors occurred like below:

Thread 1 "main" received signal SIGUSR2, User defined signal 2. 0x000055555556ae26 in spin_lock (l=0x7fff7c01e044) at ../../..//../shenango/inc/base/lock.h:49 49 while (__sync_lock_test_and_set(&l->locked, 1)) {

I don't have any idea what did I missed. Could you give me some hints to fix this? I'll appreciate your reply!

zainryan commented 3 years ago

Hi! AIFM uses SIGUSR2 and SIGUSR1 to force task preemption and GC triggering, so it's expected to see that in GDB. You can suppress them using handle SIGUSR2 nostop noprint and handle SIGUSR1 nostop noprint.

Could you please rerun the experiment using GDB and show me the call stack when segfault is triggered?

BinZlP commented 3 years ago

Thanks for your explanation!

This is the call stack when segfault has occurred:

Thread 2 "main" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff954fe700 (LWP 7531)] far_memory::GCParallelMarker::slave_fn (this=, tid=) at ../../..//src/manager.cpp:434 434 if (!ptr->meta().is_shared()) { (gdb) bt

0 far_memory::GCParallelMarker::slave_fn (this=, tid=)

at ../../..//src/manager.cpp:434

1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffc26cf6fe0)

at /usr/include/c++/9/bits/std_function.h:683

2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffc26cf6fd0) at thread.cc:15

3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128

4 0x0000000000000000 in ?? ()

... and also, another problem raised that the program sometimes stops during reading the file. Like below:

Have read 72351744 bytes. Have read 73400320 bytes. Have read 74448896 bytes. Have read 75497472 bytes. Have read 76546048 bytes. Have read 77594624 bytes. Have read 78643200 bytes. Have read 79691776 bytes. Have read 80740352 bytes. Have read 81788928 bytes. Have read 82837504 bytes. ( ... stop and not proceed )

It looked like deadlock or something, so I stopped the program with Ctrl+C. This is the call stack when killed by SIGINT after the program stopped:

Thread 1 "main" received signal SIGINT, Interrupt. 0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78 78 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt

0 0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78

1 0x00005555555a396e in kthread_yield_to_iokernel () at runtime/kthread.c:118

2 kthread_park (voluntary=) at runtime/kthread.c:244

3 0x00005555555a4dff in schedule () at ./inc/runtime/preempt.h:53

4 0x00005555555a3e80 in ?? () at runtime/sched.c:175

5 0x0000000000000000 in ?? ()

Thanks!

BinZlP commented 3 years ago

Also, I tried running test code while reducing the local cache size from 16GB to 1GB, but the problems occurred from 14GB. From 14GB, sometimes it shortly stops while reading and stops at the end (or more early) of the reading stage without moving to the compression stage. This case, the call stack is same with prev. comment's last one.

Lastly, sometimes segfault triggered after the reading is completed. Here's the log of the case:

Have read 998244352 bytes. Have read 999292928 bytes. [ 7.498414] CPU 07| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 1919 times [ 7.498425] CPU 07| <3> txq full [ 9.267466] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 4930 times [ 9.267480] CPU 02| <3> txq full [ 11.240640] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 9693 times [ 11.240655] CPU 02| <3> txq full [ 13.067836] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 8229 times [ 13.067850] CPU 02| <3> txq full Thread 2 "main" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff954fe700 (LWP 10236)] far_memory::GCParallelMarker::slave_fn (this=, tid=) at ../../..//src/manager.cpp:434

And this is the call stack of the case: (gdb) bt

0 far_memory::GCParallelMarker::slave_fn (this=, tid=)

at ../../..//src/manager.cpp:434

1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffbec176fe0)

at /usr/include/c++/9/bits/std_function.h:683

2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffbec176fd0) at thread.cc:15

3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128

4 0x0000000100000000 in ?? ()

5 0x0000100000082f00 in ?? ()

6 0x00007ffbf19e6f01 in ?? ()

7 0x00007fff88011b18 in ?? ()

8 0x000055555556ffc0 in ?? () at ../../..//inc/internal/parallel.ipp:77

9 0x000055555556ffa0 in ?? () at /usr/include/c++/9/bits/stl_deque.h:273

10 0x0000000000000000 in ?? ()

Thanks for your help :)

zainryan commented 3 years ago

Hi, thanks for your information! This looks like a bug, which is intolerable. I will try to reproduce and fix it once I get a chance. Should be soon.

zainryan commented 3 years ago

Hi, I just ran fig.11a code with local_ram=14G for 100 times on my cloudlab instance (the one mentioned in README), and doesn't observe sigfault or deadlock. Maybe what you are facing now is caused by some misconfiguration or by the actual bugs that are hard to trigger in my instance. In either case, I'd be happy to help you if I'm able to ssh into your instance. You can send me an email (zainruan@mit.edu).

BinZlP commented 3 years ago

Thanks for your kindness! Then I'll send you an email :D

zainryan commented 3 years ago

Hi Han, the commit above should fix everything. Free feel to reopen this issue if you find anything wrong.

BinZlP commented 3 years ago

Hi Zain,

Okay, thanks for your quick fix! Have a good weekend :D

Best regards, Han

2021년 6월 19일 (토) 오후 1:46, Zain Ruan @.***>님이 작성:

Hi Han, the commit above should fix everything. Free feel to reopen this issue if you find anything wrong.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AIFM-sys/AIFM/issues/7#issuecomment-864354969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCTMPAR5KPEO7KFM5OY42TTTQOLHANCNFSM46AB2YJA .