accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
308 stars 118 forks source link

Segmentation Fault during simulation in PTX mode #193

Closed beneslami closed 1 year ago

beneslami commented 1 year ago

Hi. I'm trying to run backprop benchmark in PTX mode. After successful compilation, I simply run ./backprop with a proper input file. After several seconds, a segmentation fault happens. I ran the simulation under GDB and tried to backtrace the cause of the seg fault. This is the output:

GPGPU-Sim PTX: CUDA API function "unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)" has been called.

Thread 1 "backprop-rodini" received signal SIGSEGV, Segmentation fault.
__GI___libc_write (fd=1, buf=0x55555557ac30, nbytes=2) at ../sysdeps/unix/sysv/linux/write.c:24
24      ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) bt
#0  __GI___libc_write (fd=1, buf=0x55555557ac30, nbytes=2) at ../sysdeps/unix/sysv/linux/write.c:24
#1  0x00007ffff78f9e8d in _IO_new_file_write (f=0x7ffff7a586a0 <_IO_2_1_stdout_>, data=0x55555557ac30, n=2) at fileops.c:1176
#2  0x00007ffff78fb951 in new_do_write (to_do=2, 
    data=0x55555557ac30 "\n\nGPU-Sim PTX: CUDA API function \"unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)\" has been called.\n0.ptx:33 skipping new declaration\npute capability\nss map>}\nLR:tWR:nbkgrp:"..., fp=0x7ffff7a586a0 <_IO_2_1_stdout_>) at libioP.h:948
#3  _IO_new_do_write (to_do=2, 
    data=0x55555557ac30 "\n\nGPU-Sim PTX: CUDA API function \"unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)\" has been called.\n0.ptx:33 skipping new declaration\npute capability\nss map>}\nLR:tWR:nbkgrp:"..., fp=0x7ffff7a586a0 <_IO_2_1_stdout_>) at fileops.c:426
#4  _IO_new_do_write (fp=0x7ffff7a586a0 <_IO_2_1_stdout_>, 
    data=0x55555557ac30 "\n\nGPU-Sim PTX: CUDA API function \"unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)\" has been called.\n0.ptx:33 skipping new declaration\npute capability\nss map>}\nLR:tWR:nbkgrp:"..., to_do=2) at fileops.c:423
#5  0x00007ffff78fa6b5 in _IO_new_file_xsputn (n=36, data=<optimized out>, f=<optimized out>) at libioP.h:948
#6  _IO_new_file_xsputn (f=0x7ffff7a586a0 <_IO_2_1_stdout_>, data=<optimized out>, n=36) at fileops.c:1197
#7  0x00007ffff78e1972 in __vfprintf_internal (s=0x7ffff7a586a0 <_IO_2_1_stdout_>, format=0x7ffff7f19820 "\n\nGPGPU-Sim PTX: CUDA API function \"%s\" has been called.\n", 
    ap=ap@entry=0x7fffff7ff670, mode_flags=mode_flags@entry=2) at ../libio/libioP.h:948
#8  0x00007ffff799905b in ___printf_chk (flag=flag@entry=1, format=format@entry=0x7ffff7f19820 "\n\nGPGPU-Sim PTX: CUDA API function \"%s\" has been called.\n") at printf_chk.c:33
#9  0x00007ffff7c7489e in printf (__fmt=0x7ffff7f19820 "\n\nGPGPU-Sim PTX: CUDA API function \"%s\" has been called.\n") at /usr/include/x86_64-linux-gnu/bits/stdio2.h:107
#10 announce_call (func=<optimized out>) at cuda_runtime_api.cc:297
#11 0x00007ffff7c7a08d in __cudaPushCallConfiguration (gridDim=..., blockDim=..., sharedMem=93847901674944, stream=0x0) at cuda_runtime_api.cc:3596
#12 0x0000000000000000 in ?? ()
(gdb) 

BTW, I'm using gcc 9.4 and CUDA 11.0

Thank you in advance for your help

hmhmhey commented 1 year ago

Try using gcc (g++) 7.5.0. I had the same issue and downgrading gcc worked.

beneslami commented 1 year ago

Hi @hmhmhey Thank you for your reply. I did downgrade the gcc/g++ version to 7.5 but still have the same problem. Is this problem because of gcc/g++ version ? Do I have to check other gcc/g++ versions ?

Actually, I compiled AccelSim with gcc7.5 as well tried to run the backprop again. But this time I get another seg fault as below:

GPGPU-Sim PTX: pushing kernel '_Z22bpnn_layerforward_CUDAPfS_S_S_ii' to stream 0, gridDim= (1,4096,1) blockDim = (16,16,1) 
GPGPU-Sim uArch: Shader 0 bind to kernel 1 '_Z22bpnn_layerforward_CUDAPfS_S_S_ii'
GPGPU-Sim uArch: CTA/core = 6, limited by: threads
GPGPU-Sim: Reconfigure L1 cache to 120KB

Thread 2 "backprop-rodini" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff4553700 (LWP 396535)]
0x00007ffff7a4c2db in kernel_info_t::no_more_ctas_to_run (chiplet_id=<optimized out>, this=<optimized out>) at ../abstract_hardware_model.h:310
310         int ctaid = (next_cta_per_chiplet[chiplet_id].x + m_grid_dim.x * next_cta_per_chiplet[chiplet_id].y
(gdb) 
hmhmhey commented 1 year ago

Not sure why it worked, but changing gcc version solved the libc segfault same as the first one. I don't think the second one is related to gcc version.

JRPan commented 1 year ago

I'm pretty sure we support gcc 9+. I don't know what are you using, but we certainly don't have next_cta_per_chiplet in Accel-Sim. Sorry not much I can help here.