Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
7.79k stars 287 forks source link

nvtop freezes when run on high-count gpu instances on aws #189

Open menkaur opened 1 year ago

menkaur commented 1 year ago

Recently, I tried running nvtop on p3.16xlarge instance and p4d.24xlarge on aws, and in both cases the program never correctly displayed everything and was unresponsive to my attempts to quit. nvtop works correctly on smaller instances, like p3.8xlarge, so I suspect that the number of GPU's is the problem. I'm using this with following ami id: ami-0b7ff1a8d69f1bb35, without much additional configuration

Syllo commented 1 year ago

How many GPUs are we talking about here? From a quick search online I see that 16 large has 8xNvidia Tesla-v100 GPUs and 8 large has only 4, right?

I have seen nvtop used on machines with more GPUs than that. Could you try again with a different terminal sizes? If that works, could you give me the size that has an issue so that I can look into it?

Thanks

menkaur commented 1 year ago

I have tried it on both p4d.24xlarge and p3.16xlarge. They have 8 GPU's each. I've moved away from these instances since they are not what I was looking for, but nvtop failed to start and display anything on both, so it's probably about gpu count. I remember having checked nvtop in htop, and it looked like it was consuming 100% of one core's time, and kept increasing its memory usage until I killed it. When an instance only has 4 GPUs, nvtop works without issues

On Thu, Feb 9, 2023 at 4:17 PM Maxime Schmitt @.***> wrote:

How many GPUs are we talking about here? From a quick search online I see that 16 large has 8xNvidia Tesla-v100 GPUs and 8 large has only 4, right?

I have seen nvtop used on machines with more GPUs than that. Could you try again with a different terminal sizes? If that works, could you give me the size that has an issue so that I can look into it?

Thanks

— Reply to this email directly, view it on GitHub https://github.com/Syllo/nvtop/issues/189#issuecomment-1424692585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJ7WVGXX6SDWBLA5MIW5LWWU7FVANCNFSM6AAAAAAUTMX3XI . You are receiving this because you authored the thread.Message ID: @.***>