Open nournadar opened 3 years ago
Your description is quite short, so here are some guesses:
sudo modprobe msr
, do you see /dev/cpu/*/msr
files? (ls -la /dev/cpu/*/msr
)Otherwise, please supply your installation procedure and your changes to the confik.mk
file.
Secure boot is not enabled
kernel version is: Linux 4.15.0-135-generic
I can see the msr files (this is a sample):
x_abouelna@almaha:~$ ls -la /dev/cpu/*/msr
crw------- 1 root root 202, 0 Feb 4 16:47 /dev/cpu/0/msr
crw------- 1 root root 202, 10 Feb 4 16:47 /dev/cpu/10/msr
crw------- 1 root root 202, 11 Feb 4 16:47 /dev/cpu/11/msr
crw------- 1 root root 202, 12 Feb 4 16:47 /dev/cpu/12/msr
crw------- 1 root root 202, 13 Feb 4 16:47 /dev/cpu/13/msr
crw------- 1 root root 202, 14 Feb 4 16:47 /dev/cpu/14/msr
crw------- 1 root root 202, 15 Feb 4 16:47 /dev/cpu/15/msr
crw------- 1 root root 202, 16 Feb 4 16:47 /dev/cpu/16/msr
crw------- 1 root root 202, 17 Feb 4 16:47 /dev/cpu/17/msr
crw------- 1 root root 202, 18 Feb 4 16:47 /dev/cpu/18/msr
crw------- 1 root root 202, 19 Feb 4 16:47 /dev/cpu/19/msr
crw------- 1 root root 202, 1 Feb 4 16:47 /dev/cpu/1/msr
crw------- 1 root root 202, 20 Feb 4 16:47 /dev/cpu/20/msr
crw------- 1 root root 202, 21 Feb 4 16:47 /dev/cpu/21/msr
crw------- 1 root root 202, 22 Feb 4 16:47 /dev/cpu/22/msr
It is installed on an NFS shared system and working on some servers without sudo while others not
mount --show-labels | grep /opt/xxx
/etc/auto.master.d/opt-xxx.mount on /opt/xxx type autofs (rw,relatime,fd=6,pgrp=1915,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=31142)
XXXXXXX:/opt/xxx on /opt/xxx type nfs4 (rw,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=XXXXXX,local_lock=none,addr=XXXXXX)
When I remember correctly, NFSv4 (your /opt
mountpoint) does not allow suid-root binaries.
You could try this:
$ mkdir /tmp/likwid
$ cp /opt/xxx/sbin/likwid-accessD /tmp/likwid
$ chown root:root /tmp/likwid/likwid-accessD
$ chmod u+s /tmp/likwid/likwid-accessD
$ export PATH=/tmp/likwid:$PATH
$ likwid-perfctr -C 0 -g L3 hostname
LIKWID searches for the access daemon in PATH
and with this setup, it should find the local one before the one on NFS. You can verify which daemon it uses when running with -V 3
:
$ likwid-perfctr -C 0 -g L3 -V 3 hostname
DEBUG - [access_client_startDaemon:137] Starting daemon XXX/sbin/likwid-accessD
I did the previous its still not working as far as I know NFS4 supports suid-root binaries .. its already working on 2 of my other servers with similar config but different OS versions
likwid-accessD
report anything in syslog?likwid-accessD
on the NFS? How are the permissions?-V 3
? Did it find the right access daemon?
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-41856 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-41856 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-41856 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-41856 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-41856 for CPU 0...
ERROR - [./src/access_client.c:190] No such file or directory
Exiting due to timeout: The socket file at '/tmp/likwid-41856' could not be
opened within 10 seconds. Consult the error message above
this to find out why. If the error is 'no such file or directoy',
it usually means that likwid-accessD just failed to start.
DEBUG=true
in config.mk
and try again. The accessdaemon should print UID and EUID of itself into syslog.ls -la <path_to_file>
on a client system.glibc
update and you compiled LIKWID with a more recent glibc
and try to run it now on installations with "old" glibc
? Can you try local distinct installations instead of one on the network FS?likwid-accesD
manually. It shouldn't do anything (waiting for connects for a few seconds and exit) but maybe it throws a segfault or another error. If the error is 'no such file or directoy', it usually means that likwid-accessD just failed to start.
Any updates?
I recompiled locally (No NFS share) -- added the debug=true part in config.mk
Still getting the same errors
likwid-perfctr -C 0 -g L3 -V 3 hostname
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
CPU type: Intel Xeon SandyBridge EN/EP processor
CPU clock: 2.00 GHz
CPU family: 6
CPU model: 45
CPU short: sandybridgeEP
CPU stepping: 7
CPU features: FP ACPI MMX SSE SSE2 HTT TM RDTSCP MONITOR VMX EIST TM2 SSSE SSE4.1 SSE4.2 AES AVX SSE3
CPU arch: x86_64
--------------------------------------------------------------------------------
PERFMON version: 3
PERFMON number of counters: 4
PERFMON width of counters: 48
PERFMON number of fixed counters: 3
--------------------------------------------------------------------------------
[likwid-pin] Main PID -> hwthread 0 - OK
DEBUG - [HPMinit:98] Adjusting functions for x86 architecture in daemon mode
DEBUG - [access_client_startDaemon:137] Starting daemon /usr/sbin/likwid-accessD
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:185] Still waiting for socket /tmp/likwid-28812 for CPU 0...
DEBUG - [access_client_startDaemon:197] Successfully opened socket /tmp/likwid-28812 to daemon for CPU 0
DEBUG - [HPMaddThread:143] Adding CPU 0 to access module
DEBUG - [access_client_check:505] Device check for dev 0 on CPU 0 with accessDaemon failed: no such pci device
DEBUG - [access_client_check:505] Device check for dev 0 on CPU 0 with accessDaemon failed: no such pci device
DEBUG - [access_client_check:505] Device check for dev 0 on CPU 0 with accessDaemon failed: no such pci device
sudo cat syslog | grep likwid
Mar 1 12:17:29 almaha kernel: [95143.296081] likwid-accessD[25489]: segfault at 19 ip 00007f3593702a6d sp 00007fffc81a5cb0 error 4 in libc-2.27.so[7f359366b000+1e7000]
Mar 1 12:18:15 almaha kernel: [95188.440470] likwid-accessD[25545]: segfault at 19 ip 00007f6c6c953a6d sp 00007ffc6f960ef0 error 4 in libc-2.27.so[7f6c6c8bc000+1e7000]
Mar 1 12:18:52 almaha kernel: [95226.069611] likwid-accessD[25582]: segfault at 19 ip 00007f85b4db7a6d sp 00007ffd36b5c3d0 error 4 in libc-2.27.so[7f85b4d20000+1e7000]
Mar 1 12:25:02 almaha kernel: [95596.118713] likwid-accessD[28763]: segfault at 19 ip 00007fda2fad5a6d sp 00007fffb56f7050 error 4 in libc-2.27.so[7fda2fa3e000+1e7000]
Mar 1 12:25:32 almaha kernel: [95625.929651] likwid-accessD[28800]: segfault at 19 ip 00007f5f2fa90a6d sp 00007fffdb072050 error 4 in libc-2.27.so[7f5f2f9f9000+1e7000]
Mar 1 12:25:56 almaha kernel: [95649.543198] likwid-accessD[28823]: segfault at 19 ip 00007fa57c7eaa6d sp 00007ffd809677e0 error 4 in libc-2.27.so[7fa57c753000+1e7000]
So the access daemon segfaults after the connection is established. Can you run the access daemon alone without LIKWID? Just call /usr/sbin/likwid-accessD
, wait until it returns and check the syslog whether there are any messages like exiting due to timeout
. This helps me to localize the problematic part.
Are all CPU IDs consecutively numbered? In your output of ls /dev/cpu/*/msr
, I see only 15 files with a lot of missing IDs. And I have this comment in the likwid-accessD source: NOTICE: This assumes consecutive processor Ids!
Output of /proc/cpuinfo
attached as file would help.
The Intel E5-2650 has 8 cores. Does the machine only have a single socket or multi socket? Is SMT enabled?
I tried to run the access daemon alone and I am getting a timeout as you suggested
/dev/cpu$ sudo cat /var/log/syslog | grep access*
Mar 2 13:12:17 almaha accessD: AccessDaemon runs with UID 503649, eUID 503649
Mar 2 13:12:32 almaha accessD: exiting due to timeout - no client connected after 15 seconds.
This machine has 2 sockets and 8 cores per socket ls /dev/cpu/*/msr
/dev/cpu/0/msr /dev/cpu/13/msr /dev/cpu/17/msr /dev/cpu/20/msr /dev/cpu/24/msr /dev/cpu/28/msr /dev/cpu/31/msr /dev/cpu/6/msr
/dev/cpu/10/msr /dev/cpu/14/msr /dev/cpu/18/msr /dev/cpu/21/msr /dev/cpu/25/msr /dev/cpu/29/msr /dev/cpu/3/msr /dev/cpu/7/msr
/dev/cpu/11/msr /dev/cpu/15/msr /dev/cpu/19/msr /dev/cpu/22/msr /dev/cpu/26/msr /dev/cpu/2/msr /dev/cpu/4/msr /dev/cpu/8/msr
/dev/cpu/12/msr /dev/cpu/16/msr /dev/cpu/1/msr /dev/cpu/23/msr /dev/cpu/27/msr /dev/cpu/30/msr /dev/cpu/5/msr /dev/cpu/9/msr
sudo lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
Stepping: 7
CPU MHz: 1261.677
CPU max MHz: 2800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4000.34
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
SMT is enabled
This looks all fine. All MSR files are present. It might be the PCI devices but debugging the access daemon is hard as it demonizes itself for security reasons. I was not able to use GDB on access daemon as it forks itself twice while GDB follows only one child.
I created a test for you to dig down where the problem in the daemon is. The attached tarball consists of an extracted access-daemon and a client. The description what to do is in the README file or if you type make usage
. You will need gdb
to debug the issue and run it as root!
test-daemon-sandyep.zip
I am not sure if I am doing this correctly I followed your steps but I am getting errors when I start the daemon Do I need to load something before this ? should I put this dir in a specific location? So sorry for my lack of knowledge in this area.
xxxx@xxxx:~/test-daemon-sandyep$ make daemon
cc -g -DLIKWIDSOCKETBASE=/tmp/likwid -I. accessDaemon.c -o access-daemon
xxxx@xxxx:~/test-daemon-sandyep$ make client
cc -g -DLIKWIDSOCKETBASE=/tmp/likwid connect_accessD.c -o access-client
xxxx@xxxx:~/test-daemon-sandyep$ make prepare_daemon
sudo chown root:root access-daemon
sudo chmod u+s access-daemon
xxxx@xxxx:~/test-daemon-sandyep$ make start_
start_client start_daemon start_debugger
xxxx@xxxx:~/test-daemon-sandyep$ make start_
start_client start_daemon start_debugger
xxxx@xxxx:~/test-daemon-sandyep$ make start_daemon
./access-daemon
exiting due to timeout - no client connected after 15 seconds.
Makefile:16: recipe for target 'start_daemon' failed
make: *** [start_daemon] Error 1
You configured all parts succesfully. You started only the daemon with make start_daemon
but didn't execute the other two parts.
You need three different shells (windows, tabs, however you call it), one of them with root permissions (sudo bash
), and all in the test-daemon-sandyep
folder. Use this order:
Shell Window 1:
make start_daemon
Shell Window 2 (with root permissions):
make start_debugger
Shell Window 3:
make start_client
I did exactly that and I am getting this in the debugger tab:
xxx@xxxx:~/test-daemon-sandyep# make start_debugger
gdb --command=server.gdb -p
gdb: option '-p' requires an argument
Use `gdb --help' for a complete list of options.
Makefile:22: recipe for target 'start_debugger' failed
make: *** [start_debugger] Error 1
when I run gdb and typ bt or backtrace I get no stack How do I proceed ?
That's unfortunate. I tried to make it as convenient as possible for you.
You need the PID of the access-daemon
executable, so after make start_daemon
, try a pidof access-daemon
to get it and start the debugger like this:
gdb --command=server.gdb -p ACCESS_DAEMON_PID
Afterwards proceed to make start_client
Is this what we're looking for? On the first machine I keep getting this
make start_daemon
./access-daemon
ERROR - [accessDaemon.c:2007] bind failed - Address already in use
Makefile:16: recipe for target 'start_daemon' failed
make: *** [start_daemon] Error 1
So the below I did on another non-working machine which is a broadwell with 2 sockets and 14 cores/socket
~/test-daemon-sandyep# gdb --command=server.gdb -p 26715
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 26715
Reading symbols from /home/x_abouelna/test-daemon-sandyep/access-daemon...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.27.so...done.
done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.27.so...done.
done.
0x00007f623a220674 in __libc_accept (fd=3, addr=..., len=0x7fff0483ea5c) at ../sysdeps/unix/sysv/linux/accept.c:26
26 ../sysdeps/unix/sysv/linux/accept.c: No such file or directory.
[Inferior 1 (process 26715) exited normally]
(gdb) bt
No stack.
(gdb) backtrace
No stack.
In the first case, the daemon was already running and you tried it again. The daemon creates a socket file (/tmp/likwid-123456) and this file already exists. Wait until it vanishes (around 30 seconds) or delete it yourself and try again.
In the second case, there is no segfault in the daemon. The reason is that the test daemon is tailored for Intel SandyBridge and doesn't even know the register/device list for Broadwell. I could update the daemon but we should figure it out on SandyBridge first.
So I deleted the temp file on the sandybridge and re-ran the above This is what I am getting in the debugger:
~/test-daemon-sandyep# gdb --command=server.gdb -p 7879
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 7879
Reading symbols from /home/x_abouelna/test-daemon-sandyep/access-daemon...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.27.so...done.
done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.27.so...done.
done.
0x00007fb2e7f5f674 in __libc_accept (fd=3, addr=..., len=0x7ffedc2c411c) at ../sysdeps/unix/sysv/linux/accept.c:26
26 ../sysdeps/unix/sysv/linux/accept.c: No such file or directory.
[Inferior 1 (process 7879) exited normally]
This output looks right. But there is no error, no segfault or anything, that's unexpected. I'll send you a new test daemon...
Sorry for the delay. I lost track with the issue. Is it still a problem on these machines?
I had the same issue and while I am still not really sure of its root cause the system one was borked and the one I installed via spack worked without problem. Just leaving this here if people are also scratching their had 😆
It depends on the access mode and the default Spack installation uses the perf_event backend. This mode does not require msr files and proper permissions on them but checks perf_events' paranoid level. There is a way to install through Spack in the other access modes, but it requires the user to run an auto-generated script afterwards with root privileges to set proper permissions for LIKWID. This script does not check/touch the msr files though.
My users keep getting cannot get access to MSRs. Please check permissions to the MSRs when I run "likwid-perfctr -g CACHES -m ls" although I ran sudo modprobe msr but I somehow cannot run that command without sudo I need normal users to be able to run this without sudo.