Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
8.1k stars 293 forks source link

"Floating point exception (core dumped)" with certain terminal sizes / number of GPUs? #147

Closed EricCousineau-TRI closed 2 years ago

EricCousineau-TRI commented 2 years ago

I am using current master (09bead48), but will get "Floating point exception (core dumped)" in my terminal for certain terminal sizes.

Environment

I am using Ubuntu 20.04 on an AWS EC2 instance (p3.16xlarge), which has 8 GPUs. I build + install with following commands:

cd nvtop
mkdir -p build && cd build
cmake ..
make -j

I run this over an SSH session, using tmux (3.0a-2ubuuntu0.3).

Reproduction

For certain terminal sizes, seemingly only on the EC2 instance, I get a segfault. For example, using (cols x lines):

On my local machine (2 GPUs), I cannot reproduce this error w/ the same screen size.

Extra

Example of measuring terminal size: https://stackoverflow.com/questions/263890/how-do-i-find-the-width-height-of-a-terminal-window echo "$(tput cols) x $(tput lines)"

EricCousineau-TRI commented 2 years ago

Using gdbserver to debug (hard to see stacktrace when curses and segfaults are involved ;), I get the following stacktrace:

Program received signal SIGFPE, Arithmetic exception.
0x000055555560c810 in compute_sizes_from_layout (devices_count=8, device_header_rows=3, device_header_cols=78, rows=26, cols=189, to_draw=0x603000000280, process_displayed=1951, 
    device_positions=0x7fffffffdc00, num_plots=0x611000000200, plot_positions=0x7fffffffddd0, map_device_to_plot=0x7fffffffdb60, process_position=0x7fffffffdd90, 
    setup_position=0x7fffffffddb0) at /home/ubuntu/nvtop/src/interface_layout_selection.c:380
380                   (max_plot_cols - cols_needed_box_drawing) % num_info_per_plot[j];
(gdb) bt
#0  0x000055555560c810 in compute_sizes_from_layout (devices_count=8, device_header_rows=3, device_header_cols=78, rows=26, cols=189, to_draw=0x603000000280, process_displayed=1951, 
    device_positions=0x7fffffffdc00, num_plots=0x611000000200, plot_positions=0x7fffffffddd0, map_device_to_plot=0x7fffffffdb60, process_position=0x7fffffffdd90, 
    setup_position=0x7fffffffddb0) at /home/ubuntu/nvtop/src/interface_layout_selection.c:380
#1  0x00005555555ef505 in initialize_all_windows (dwin=0x611000000180) at /home/ubuntu/nvtop/src/interface.c:295
#2  0x0000555555606e85 in update_window_size_to_terminal_size (inter=0x611000000180) at /home/ubuntu/nvtop/src/interface.c:1805
#3  0x00005555555ead25 in main (argc=1, argv=0x7fffffffe468) at /home/ubuntu/nvtop/src/nvtop.c:327
EricCousineau-TRI commented 2 years ago

Tried some naive fixes, but got some other errors. Help would be appreciated :sweat_smile:

Syllo commented 2 years ago

Hello @EricCousineau-TRI,

Thanks for the bug report.

I will create proper unit tests. This will make our lives easier while debugging this issue. ncurses is wonderful when you don't have to debug :wink:

Syllo commented 2 years ago

Could you test the branch fix_147 please?

EricCousineau-TRI commented 2 years ago

Works wonderfully! Tested with view size per above, using a1bdc96 and -DBUILD_TESTING=OFF. Confirmed that old master build still segfaults.

Syllo commented 2 years ago

Great, thanks for your help. I'll do some more testing and merge it shortly.

Syllo commented 2 years ago

Merged into master. I will do a minor release with all that has been fixed. I am just waiting for some feedback/bugs that usually manifest within a week or two.