Open Latrolage opened 2 years ago
Hello, How many GPUs do you have on your system? Are they AMD ones?
Yes, 1 amd and 1 nvidia
Could you please try something: Compile with
cmake .. -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=OFF
cmake .. -DNVIDIA_SUPPORT=OFF -DAMDGPU_SUPPORT=ON
In both cases run nvtop and see if you can reproduce the slowdown with only one vendor active
It doesn't happen when just nvidia support is compiled. it happens when with amdgpu is compiled
Possibly scanning all the fds in /proc caused the lag? htop does this too so I wasn't concerned.
Could you do a $ time strace -c path/to/nvtop
, wait a few few seconds, and exit nvtop. it should show something like:
$ time strace -c nvtop/build/src/nvtop
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
64.39 0.380565 6 58393 259 newfstatat
14.45 0.085425 6 13474 4269 openat
10.44 0.061728 19 3205 getdents64
6.28 0.037105 4 9201 close
2.07 0.012231 3 3328 2 fcntl
1.16 0.006847 12 566 read
0.53 0.003108 37 82 1 ioctl
0.29 0.001725 4 400 kcmp
0.15 0.000871 4 214 write
0.07 0.000396 8 49 poll
0.05 0.000268 3 76 60 readlink
0.04 0.000246 246 1 execve
0.03 0.000201 4 47 mmap
0.02 0.000097 3 31 rt_sigaction
0.01 0.000082 4 18 lseek
0.01 0.000049 4 12 mprotect
0.01 0.000036 4 8 munmap
0.00 0.000026 2 11 pread64
0.00 0.000017 8 2 1 access
0.00 0.000015 0 19 brk
0.00 0.000003 3 1 getrandom
0.00 0.000002 2 1 arch_prctl
0.00 0.000002 2 1 set_tid_address
0.00 0.000002 2 1 set_robust_list
0.00 0.000002 2 1 prlimit64
0.00 0.000002 2 1 rseq
------ ----------- ----------- --------- --------- ----------------
100.00 0.591051 6 89143 4592 total
real 0m10.809s
user 0m0.353s
sys 0m1.749s
Also, approximately how many processes are running and how many fds are open? i.e. what's the output of $ ls -d /proc/{1..9}*/fd/* | wc -l
and $ ls /proc/{1..9}*/fd/ | wc -l
(if you nvtop as root, the second command should also be run as root)?
$ time strace -c /usr/bin/nvtop
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
60.03 0.090864 3 24608 1 newfstatat
14.18 0.021468 4 5342 1621 openat
10.68 0.016164 13 1205 getdents64
7.12 0.010783 2 3716 close
3.04 0.004606 6 659 read
2.35 0.003555 2 1342 fcntl
0.86 0.001304 2 490 kcmp
0.65 0.000989 3 276 write
0.42 0.000640 3 164 132 readlink
0.31 0.000464 7 61 1 ioctl
0.10 0.000144 1 78 poll
0.07 0.000111 2 40 lseek
0.05 0.000074 2 28 mmap
0.05 0.000072 1 47 rt_sigaction
0.03 0.000047 2 22 brk
0.02 0.000027 4 6 munmap
0.02 0.000025 2 10 mprotect
0.01 0.000014 7 2 2 connect
0.01 0.000011 5 2 socket
0.01 0.000008 4 2 1 access
0.00 0.000000 0 4 pread64
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 prlimit64
0.00 0.000000 0 1 getrandom
0.00 0.000000 0 1 rseq
------ ----------- ----------- --------- --------- ----------------
100.00 0.151370 3 38112 1759 total
real 0m9.998s
user 0m0.047s
sys 0m0.519s
$ ls -d /proc/{1..9}*/fd/* | wc -l
[...redacted cannot open directory. permission denied stuff from ls]
4416
$ ls /proc/{1..9}*/fd/ | wc -l
[...redacted cannot open directory. permission denied stuff from ls]
4658
Video just so we are sure we are talking about the same issue:
Where the highlight jumps is where the lag/stutter happens, I continue pressing up/down arrow and it catches up after it updates the screen
real 0m9.998s user 0m0.047s sys 0m0.519s
To make sure, during this run, was lag happening? Because (0.519 + 0.047) / 9.998 = 5.7% busy and that isn't high enough to cause major lag just from being busy I think
Yes, it had the same stutter as in the video
When it lags, is the entire screen laggy, or just nvtop? I'm wondering if it's nvtop itself being laggy, or nvtop doing something to the gpu causing the gpu to become laggy.
Just nvtop
I have no idea what's wrong then. I have 5443 fds opened by my user, 8248 fds total (sudo ls /proc/{1..9}*/fd/ | wc -l
), and I'm experiencing no lag at all.
Let's see if @Syllo has a better idea. (I haven't read much of the UI code of nvtop)
From what I see in the video, it freezes when gathering the information, every second or so (which is the default update rate). The interface freezes because everything runs in the same thread.
I do not see that behavior on my system either, even when I increase the load with more processes/fd than what @Latrolage reported. I can observe a very slight slowdown when strace is running.
This might be exacerbated on systems with many AMD GPUS, in which case we will go through /proc many times.
I will think of refactoring the /proc traversal at some point and maybe put the info gathering in its own thread, but I don't see how to avoid the fstats calls.
This issue is also seen in my case with amd gpu. Here is my gpu and it's drivers. This problem is also seen using btop.
26:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
Subsystem: Micro-Star International Co., Ltd. [MSI] Radeon RX 580 ARMOR 8G OC
Kernel driver in use: amdgpu
Kernel modules: amdgpu
This issue is due to pcie_bw
reads not being threaded.
pcie_bw
causes a 1s sleep on each read, during which the nvtop
thread stops.
@Syllo
I suggest disabling pcie_bw
read for amdgpu.
pcie_bw
is not supported from Vega20.
src/extract_gpuinfo_amdgpu.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/extract_gpuinfo_amdgpu.c b/src/extract_gpuinfo_amdgpu.c
index 39b20b9..3de1093 100644
--- a/src/extract_gpuinfo_amdgpu.c
+++ b/src/extract_gpuinfo_amdgpu.c
@@ -366,10 +366,12 @@ static void initDeviceSysfsPaths(struct gpu_info_amdgpu *gpu_info) {
// Open the PCIe bandwidth file for dynamic info gathering
gpu_info->PCIeBW = NULL;
+ /*
int pcieBWFD = openat(sysfsFD, "pcie_bw", O_RDONLY);
if (pcieBWFD) {
gpu_info->PCIeBW = fdopen(pcieBWFD, "r");
}
+ */
// Open the power cap file for dynamic info gathering
gpu_info->powerCap = NULL;
E.g. When scrolling through the list of apps utilising the GPU with arrow keys, after 3 or so entries, it stutters, it stops scrolling even if you press down/up arrow keys and it blinks to what it's supposed to after around a second.
It happens in setup menu too. Edit: it happens with mouse scrolling too
Also, is there a way to separate/distinguish which application is running on which GPU?