RIPE-NCC / ripe-atlas-software-probe

GNU General Public License v3.0
230 stars 57 forks source link

SIGSEGV to perd on openwrt mt7621 #74

Open Ansuel opened 1 year ago

Ansuel commented 1 year ago

I'm running a probe with openwrt mt7621 and I notice lots of these error. Wonder if you are interested in these crash. Fell free to ask any question/help on debugging this.

Tue Oct  4 17:46:50 2022 kern.info kernel: [415035.046566] do_page_fault(): sending SIGSEGV to perd for invalid read access from 633c556a
Tue Oct  4 17:46:50 2022 kern.info kernel: [415035.054990] epc = 77ebab54 in libc.so[77e3b000+aa000]
Tue Oct  4 17:46:50 2022 kern.info kernel: [415035.060184] ra  = 77ebc914 in libc.so[77e3b000+aa000]
michel-stam commented 1 year ago

Hi Ansuel,

Thanks for letting us know. Do you have any idea what measurement the unit was running at the time of the segfault?

Is it consistent or intermittent?

Regards,

Michel

Ansuel commented 1 year ago

I got this everytime i restart the atlas bins, so i can repro this easily. Also I notice sometime this gets triggered but not in normal testing, if you want I can provide my probe id if that can be useful.

michel-stam commented 1 year ago

Hi @Ansuel,

Can you check what the memory usage is for this system?

Also, I have noticed various other references to OpenWRT and processes segfaulting in this manner. What release of OpenWRT are you on?

Regards,

Michel

Ansuel commented 1 year ago

@michel-stam Hi normally there is plenty of space... for example for my router running for 3 days with atlas tools i have 350mb of free ram.

I normally run everything on master OpenWrt SNAPSHOT r20751-6e9613844c

Linux version 5.15.69 (ansuel@Ansuel-xps) (mipsel-openwrt-linux-musl-gcc (OpenWrt GCC 12.2.0 r20746-a67f484e67) 12.2.0, GNU ld (GNU Binutils) 2.38) #0 SMP Sun Sep 25 14:07:16 2022

by what i notice online that kind of error are related to memory not correctly freed or handled by the bin but i can be totally wrong.

But again i can repro the error with a simple /etc/init.d/atlas restart so we have plenty of way to investigate this.

andreagdipaolo commented 1 year ago

Same issue here, the 5040 probe is correctly recognised as online but not sending data. Router running for 3 days with 200M free RAM. Probe never delivered any result.

Target Platform ramips/mt7621 Firmware Version OpenWrt 22.03.0 r19685-512e76967f / LuCI openwrt-22.03 branch git-22.245.77528-487e58a

Sat Oct 15 19:54:22 2022 user.notice atlas-sw-probe: Stopping atlas sw probe
Sat Oct 15 19:54:22 2022 user.notice atlas-sw-probe: Kill all atlas processes
Sat Oct 15 19:54:22 2022 daemon.err ATLAS[32264]: Terminated
Sat Oct 15 19:54:22 2022 user.notice atlas-sw-probe: Kill ssh tunnel
Sat Oct 15 19:54:23 2022 local6.err perd[7825]: crond (busybox 1.26.2) started, log level 8
Sat Oct 15 19:54:23 2022 local6.err eperd[7827]: crond (busybox 1.26.2) started, log level 8
Sat Oct 15 19:54:23 2022 local6.err eperd[7827]: root: No such file or directory
Sat Oct 15 19:56:00 2022 kern.info kernel: [271248.494076] do_page_fault(): sending SIGSEGV to eperd for invalid read access from 634af430
Sat Oct 15 19:56:00 2022 kern.info kernel: [271248.502500] epc = 77dc8e3c in libc.so[77d4a000+a9000]
Sat Oct 15 19:56:00 2022 kern.info kernel: [271248.508113] ra  = 77dcab5c in libc.so[77d4a000+a9000]
Sat Oct 15 19:57:30 2022 daemon.err ATLAS[7748]: condmv: not moving, destination '/usr/libexec/atlas-probe-scripts/data/out/v6addr.txt' exists
Sat Oct 15 19:57:30 2022 daemon.err ATLAS[7748]: condmv: not moving, destination '/usr/libexec/atlas-probe-scripts/data/out/simpleping' exists
Sat Oct 15 19:57:33 2022 local6.err perd[7825]: USER root pid   0 cmd httppost -A 9015 --set-time 2 --delete-file --maxpostsize 9000000 --post-header /home/atlas/status/p_to_c_report_header --post-dir /home/atlas/data/out --post-footer /home/atlas/status/con_session_id.txt -O /home/atlas/data/new/main_sent.vol http://127.0.0.1:8080/?PROBE_ID=[redacted]&SESSION_ID=[redacted] >> /home/atlas/data/new/main
Sat Oct 15 19:57:33 2022 kern.info kernel: [271341.096549] do_page_fault(): sending SIGSEGV to perd for invalid read access from 634af48d
Sat Oct 15 19:57:33 2022 kern.info kernel: [271341.104977] epc = 77e00e08 in libc.so[77d82000+a9000]
Sat Oct 15 19:57:33 2022 kern.info kernel: [271341.110248] ra  = 77e02b5c in libc.so[77d82000+a9000]

Available for further testing.

michel-stam commented 1 year ago

Hi,

Thanks for the debug. It seems indeed eperd is having some issues.

Would you be able to test on a stable OpenWRT release? Just to make sure there's not a development effort that causes the problem? I'm not excluding at this point there's a bug, but at the same time I don't have any other platform with this problem (amongst which is an ath79, which is also a MIPS, 2 ARM systems and some x86_64 systems).

We are in the process of pushing 5080 to the GIT repo, which is delayed somewhat.

Regards,

Michel

andreagdipaolo commented 1 year ago

Would you be able to test on a stable OpenWRT release?

Hi Michel, I was using a stable release, 2022.03, I was not using a snapshot. Are you referring to me, to Ansuel or to both of us?

Yesterday I updated to 22.03.2 test release and I encounter the same issue. Do you mean we should downgrade to 2021.02?

Thanks Andrea

michel-stam commented 1 year ago

Hey Andrea,

Sorry I thought I was responding to Ansuel, but thank you for testing with a stable release of OpenWRT.

It seems you're using the public OpenWRT package, which has changed the internals somewhat from the version RIPE NCC develops in-house. I cannot guarantee its function, sorry. That being said, I'm not unwilling to help, I'll do what I can.

Looking at the logfile:

Sat Oct 15 19:54:23 2022 local6.err eperd[7827]: root: No such file or directory
Sat Oct 15 19:56:00 2022 kern.info kernel: [271248.494076] do_page_fault(): sending SIGSEGV to eperd for invalid read access from 634af430
...
Sat Oct 15 19:57:33 2022 local6.err perd[7825]: USER root pid   0 cmd httppost -A 9015 --set-time 2 --delete-file --maxpostsize 9000000 --post-header /home/atlas/status/p_to_c_report_header --post-dir /home/atlas/data/out --post-footer /home/atlas/status/con_session_id.txt -O /home/atlas/data/new/main_sent.vol http://127.0.0.1:8080/?PROBE_ID=[redacted]&SESSION_ID=[redacted] >> /home/atlas/data/new/main
Sat Oct 15 19:57:33 2022 kern.info kernel: [271341.096549] do_page_fault(): sending SIGSEGV to perd for invalid read access from 634af48d

eperd is complaining its "crontab" file 'root' does not exist. that may mean that the path that the data files are stored do not exist.

Can you check whether all the paths referenced by eperd/perd/eooqd exist? -P references the PID file and should be a writable path to a file -c references a directory that should be writable. eeoqd has one argument referencing a prefix, make sure that the anything prior to the last / points to an existing directory (/var/atlas-probe/crons/oneoff is gonna create files prefixed with one-off in the /var/atlas-probe/crons directory).

Another area to test is the "telnetd" process. I recall that on the public OpenWRT versions there were problems starting that particular process. Please make sure that its running.

This is a directory structure you'll typically find on a working atlas software probe in /var/atlas-probe/crons:

drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 10
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 11
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 12
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 13
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 14
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 15
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 16
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 17
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 18
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 19
drwxr-xr-x.  2 atlas atlas  37 Oct 17 14:05 2
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 20
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 3
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 4
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 5
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 6
drwxr-xr-x.  2 atlas atlas  18 Oct 17 14:05 7
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 8
drwxr-xr-x.  2 atlas atlas   6 Sep 14 11:42 9
drwxr-xr-x.  2 atlas atlas  18 Oct 17 14:06 main
-rw-r--r--.  1 atlas atlas  95 Oct 17 15:09 oneoff.curr

I hope this helps, please let me know.

Best regards

Michel

Ansuel commented 1 year ago

Just to add some info... from master and 22.03 there are not many difference related to this package (but i understand you want info from stable build)

eperd is complaining its "crontab" file 'root' does not exist. that may mean that the path that the data files are stored do not exist.

could be that crond is trying to find a home directory for root but he is not finding it? no idea how it works internally but i agree that we should understand where that comes from...

Can you check whether all the paths referenced by eperd/perd/eooqd exist? -P references the PID file and should be a writable path to a file -c references a directory that should be writable. eeoqd has one argument referencing a prefix, make sure that the anything prior to the last / points to an existing directory (/var/atlas-probe/crons/oneoff is gonna create files prefixed with one-off in the /var/atlas-probe/crons directory).

from the scripts the base_dir is set to /usr/libexec/atlas-probe-scripts and it does correctly exist... from there cron dir and the run dir correctly exist and the pid are written to it... what i notice is that cron never had files in them... every directory is empty...

Another area to test is the "telnetd" process. I recall that on the public OpenWRT versions there were problems starting that particular process. Please make sure that its running.

do you have a way to test this? From what i can see the process should be running but i wonder if there is a communication problem... but thinking about it, if telentd wasn't working then the probe should not register at all right?


Actually now that i'm checking it i have file in cron directory... but i can clearly find the problem here...

this is from main

image

The cron values are absolute path and doesn't follow the BASE_DIR variable...

let me do a test but this seems a problem that should be fixed on your end as it doesn't make sense to provide a BASE_DIR variable where each internal script can use a different base_dir and then hardcode it in the cron that actually execute stuff...

But correct me if i'm wrong... anyway while at it I will try to make a symbolic link from that hardcoded path to the real one.

michel-stam commented 1 year ago

Hi Ansuel,

Difficult to say. The author of the package wrote his own scripts around the atlas startup process, which doesn't seem to work perfectly. The existing process in the RIPE NCC version isn't perfect either, and will be looked at at some point in the future. Not that that helps you right now.

The code indeed seems to reference /home/atlas (probe-busybox/include/libbb.h), which as you suggest could be a softlink to where you'd like to store the data. Its not ideal, but it is a quick fix until we have had the time to refactor this bit of code. I am hoping you should start to see improvements in this area as of 5090.

Ansuel commented 1 year ago

Main problem is that with the workaround in place still we have

Mon Oct 17 16:17:04 2022 kern.info kernel: [749105.311276] do_page_fault(): sending SIGSEGV to eperd for invalid read access from 634d63e0
Mon Oct 17 16:17:04 2022 kern.info kernel: [749105.319726] epc = 77dd6b88 in libc.so[77d57000+aa000]
Mon Oct 17 16:17:04 2022 kern.info kernel: [749105.324967] ra  = 77dd8914 in libc.so[77d57000+aa000]
Mon Oct 17 16:17:11 2022 local6.err perd[26702]: USER root pid   0 cmd httppost -A 9015 --set-time 2 --delete-file --maxpostsize 9000000 --post-header /home/atlas/status/p_to_c_report_header --post-dir /home/atlas/data/out --post-footer /home/atlas/status/con_session_id.txt -O /home/atlas/data/new/main_sent.vol http://127.0.0.1:8080/?PROBE_ID=[redact]&SESSION_ID=[redact] >> /home/atlas/data/new/main
Mon Oct 17 16:17:11 2022 kern.info kernel: [749112.063571] do_page_fault(): sending SIGSEGV to perd for invalid read access from 634d63e7
Mon Oct 17 16:17:11 2022 kern.info kernel: [749112.072000] epc = 77eabb54 in libc.so[77e2c000+aa000]
Mon Oct 17 16:17:11 2022 kern.info kernel: [749112.077229] ra  = 77ead914 in libc.so[77e2c000+aa000]

and now data/new/main is empty :(

The command works correctly tho (by manually running it)

root@Ansuel-Router:/usr/libexec/atlas-probe/bin# total size in dir: 8325
httppost: before getaddrinfo
httppost: before connect
httppost: sending request
posting file '/home/atlas/data/out/simpleping'
posting file '/home/atlas/data/out/v6addr.txt'
httppost: getting result
httppost: getting reply headers
httppost: got line 'Server: nginx'
httppost: got line 'Date: Mon, 17 Oct 2022 14:21:33 GMT'
httppost: got line 'Content-Type: text/plain'
httppost: got line 'Transfer-Encoding: chunked'
httppost: got line 'Connection: close'
httppost: writing output
httppost: chunked 1, content_length -1
httppost: got chunk line '3'
httppost: chunk data 'OK
'
httppost: got chunk line '0'
httppost: deleting files
unlinking file '/home/atlas/data/out/simpleping'
unlinking file '/home/atlas/data/out/v6addr.txt'
httppost: done
Ansuel commented 1 year ago

@michel-stam ok can confirm that both eperd and perd crash with the following command

these are the main cron program and because of this nothing is send as the periodical cron are not run... the process is not restored so the probe is leaved in this unstable state where it's registered but it does nothing. Fun thing is that as they crash, the pid is still there. (would be an idea to add some check if these 2 crucial program are actually running by checking the pid in the run dir actually exist in the system?)

./eperd -c /usr/libexec/atlas-probe-scripts/crons/7 -A 9
807 -P /usr/libexec/atlas-probe-scripts/run/perd-7.pid.vol -O /home/atlas/data/new/7 -i 7
./perd -c /usr/libexec/atlas-probe-scripts/crons/main -A
 9801 -P /usr/libexec/atlas-probe-scripts/run/perd-main.pid.vol
michel-stam commented 1 year ago

Hi Ansuel,

Thanks for the help, appreciate it.

We actually added a restart functionality in 5060, and fixed some issues in 5070 with that. However, if the process continuously crashes that isn't gonna help a lot, it will just restart it over and over.

Do you have the option of running the perd and eperd process through strace to see where the crash occurs? I'm half expecting some directory or path to not exist.

Regards,

Michel

Ansuel commented 1 year ago

Do you have the option of running the perd and eperd process through strace to see where the crash occurs? I'm half expecting some directory or path to not exist.

is a debug build needed (for the atlas busybox) or i can just run the command with strace?

Ansuel commented 1 year ago

@michel-stam this is strace

root@Ansuel-Router:/usr/libexec/atlas-probe/bin# strace ./perd -f -c /usr/libexec/atlas-probe-scripts/cro
ns/main -A 9801 -P /usr/libexec/atlas-probe-scripts/run/perd-main.pid.vol
execve("./perd", ["./perd", "-f", "-c", "/usr/libexec/atlas-probe-scripts"..., "-A", "9801", "-P", "/usr/                                                                                    libexec/atlas-probe-scripts"...], 0x7fdd5c4c /* 14 vars */) = 0
set_thread_area(0x77ef5df0)             = 0
set_tid_address(0x77eed69c)             = 28384
brk(NULL)                               = 0x1a46000
brk(0x1a48000)                          = 0x1a48000
mmap2(0x1a46000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1a46000
open("/etc/ld-musl-mipsel-sf.path", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or director                                                                                    y)
open("/lib/libcrypto.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libcrypto.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or dire                                                                                    ctory)
open("/usr/lib/libcrypto.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MN                                                                                    T_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2185599, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\10\0\1\0\0\0\0\0\0\0004\0\0\0"..., 936) = 936
mmap2(NULL, 2269184, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x77c08000
mmap2(0x77e15000, 118784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x1fd000) = 0x77e15000
mmap2(0x77e2e000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x77e2e000
close(3)                                = 0
open("/lib/libssl.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libssl.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directo                                                                                    ry)
open("/usr/lib/libssl.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MN                                                                                    T_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=499355, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\10\0\1\0\0\0\0\0\0\0004\0\0\0"..., 936) = 936
mmap2(NULL, 565248, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x77b7e000
mmap2(0x77c00000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x72000) = 0x77c00000
close(3)                                = 0
open("/lib/libgcc_s.so.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MN                                                                                    T_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=82199, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\10\0\1\0\0\0\0\0\0\0004\0\0\0"..., 936) = 936
mmap2(NULL, 151552, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x77b59000
mmap2(0x77b7c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x13000) = 0x77b7c000
close(3)                                = 0
mprotect(0x77e15000, 86016, PROT_READ)  = 0
mprotect(0x77c00000, 20480, PROT_READ)  = 0
mprotect(0x77b7c000, 4096, PROT_READ)   = 0
mprotect(0x471000, 4096, PROT_READ)     = 0
getuid()                                = 0
chdir("/usr/libexec/atlas-probe-scripts/crons/main") = 0
socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 12) = 0
getpid()                                = 28384
sendto(3, "<179>Oct 17 14:53:37 perd[28384]"..., 78, 0, NULL, 0) = 78
open("/dev/urandom", O_RDONLY|O_LARGEFILE) = 4
read(4, "\275\271\5~", 4)               = 4
close(4)                                = 0
unlink("cron.update")                   = -1 ENOENT (No such file or directory)
chdir("/usr/libexec/atlas-probe-scripts/crons/main") = 0
open(".", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 4
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77ee9000
getdents64(4, 0x77ee9038 /* 3 entries */, 2048) = 72
open("/etc/passwd", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 5
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77ee8000
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
read(5, "root:x:0:0:root:/root:/bin/ash\nd"..., 1024) = 530
_llseek(5, -499, [31], SEEK_CUR)        = 0
close(5)                                = 0
munmap(0x77ee8000, 4096)                = 0
open("root", O_RDONLY|O_LARGEFILE)      = 5
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77ee8000
statx(5, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MN                                                                                    T_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=924, ...}) = 0
read(5, "60 25 2000000000 UNIFORM 20 http"..., 1024) = 924
read(5, "", 1024)                       = 0
close(5)                                = 0
munmap(0x77ee8000, 4096)                = 0
getdents64(4, 0x77ee9038 /* 0 entries */, 2048) = 0
close(4)                                = 0
munmap(0x77ee9000, 8192)                = 0
open("/usr/libexec/atlas-probe-scripts/run/perd-main.pid.vol", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666                                                                                    ) = 4
statx(4, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MN                                                                                    T_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=0, ...}) = 0
getpid()                                = 28384
write(4, "28384\n", 6)                  = 6
close(4)                                = 0
nanosleep({tv_sec=10, tv_nsec=0}, {tv_sec=5, tv_nsec=549480575}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
open("cron.update", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
unlink("cron.update")                   = -1 ENOENT (No such file or directory)
chdir("/usr/libexec/atlas-probe-scripts/crons/main") = 0
open(".", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 4
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77ee9000
getdents64(4, 0x77ee9068 /* 3 entries */, 2048) = 72
open("/etc/passwd", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 5
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77ee8000
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
read(5, "root:x:0:0:root:/root:/bin/ash\nd"..., 1024) = 530
_llseek(5, -499, [31], SEEK_CUR)        = 0
close(5)                                = 0
munmap(0x77ee8000, 4096)                = 0
open("root", O_RDONLY|O_LARGEFILE)      = 5
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77ee8000
statx(5, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=924, ...}) = 0
read(5, "60 25 2000000000 UNIFORM 20 http"..., 1024) = 924
read(5, "", 1024)                       = 0
close(5)                                = 0
munmap(0x77ee8000, 4096)                = 0
getdents64(4, 0x77ee9068 /* 0 entries */, 2048) = 0
close(4)                                = 0
munmap(0x77ee9000, 8192)                = 0
nanosleep({tv_sec=17, tv_nsec=0}, 0x7fb11f10) = 0
statx(AT_FDCWD, "/home/atlas/data/out/main", AT_STATX_SYNC_AS_STAT, STATX_BASIC_STATS, 0x7fb10820) = -1 ENOENT (No such file or directory)
open("/home/atlas/data/new/main", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 4
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77eea000
fcntl64(4, F_GETFL)                     = 0x2009 (flags O_WRONLY|O_APPEND|O_LARGEFILE)
ioctl(4, TIOCGWINSZ, 0x7fb10928)        = -1 ENOTTY (Not a tty)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x634d6c8c} ---
+++ killed by SIGSEGV +++
Segmentation fault
Ansuel commented 1 year ago

also running

strace ./condmv -A "RESULT 9901 ongoing" /home/atlas/data/new/main /home/atlas/data/out/main

I have

root@Ansuel-Router:/usr/libexec/atlas-probe/bin# strace ./condmv -A "RESULT 9901 ongoing" /home/atlas/data/new/main /home/atlas/data/out/main
execve("./condmv", ["./condmv", "-A", "RESULT 9901 ongoing", "/home/atlas/data/new/main", "/home/atlas/data/out/main"], 0x7fa7d1a0 /* 14 vars */) = 0
set_thread_area(0x77efbdf0)             = 0
set_tid_address(0x77ef369c)             = 28687
brk(NULL)                               = 0x2059000
brk(0x205b000)                          = 0x205b000
mmap2(0x2059000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2059000
open("/etc/ld-musl-mipsel-sf.path", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/libcrypto.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libcrypto.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/libcrypto.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2185599, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\10\0\1\0\0\0\0\0\0\0004\0\0\0"..., 936) = 936
mmap2(NULL, 2269184, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x77c0e000
mmap2(0x77e1b000, 118784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x1fd000) = 0x77e1b000
mmap2(0x77e34000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x77e34000
close(3)                                = 0
open("/lib/libssl.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libssl.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/libssl.so.1.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=499355, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\10\0\1\0\0\0\0\0\0\0004\0\0\0"..., 936) = 936
mmap2(NULL, 565248, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x77b84000
mmap2(0x77c06000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x72000) = 0x77c06000
close(3)                                = 0
open("/lib/libgcc_s.so.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=82199, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\10\0\1\0\0\0\0\0\0\0004\0\0\0"..., 936) = 936
mmap2(NULL, 151552, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x77b5f000
mmap2(0x77b82000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x13000) = 0x77b82000
close(3)                                = 0
mprotect(0x77e1b000, 86016, PROT_READ)  = 0
mprotect(0x77c06000, 20480, PROT_READ)  = 0
mprotect(0x77b82000, 4096, PROT_READ)   = 0
mprotect(0x471000, 4096, PROT_READ)     = 0
getuid()                                = 0
statx(AT_FDCWD, "/home/atlas/data/out/main", AT_STATX_SYNC_AS_STAT, STATX_BASIC_STATS, 0x7fb8de40) = -1 ENOENT (No such file or directory)
open("/home/atlas/data/new/main", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77ef0000
fcntl64(3, F_GETFL)                     = 0x2009 (flags O_WRONLY|O_APPEND|O_LARGEFILE)
ioctl(3, TIOCGWINSZ, 0x7fb8df48)        = -1 ENOTTY (Not a tty)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x634d6eee} ---
+++ killed by SIGSEGV +++
Segmentation fault

why it seems to me a problem with condmv ?

Ansuel commented 1 year ago

Is it normal that the descriptor is 3 on writing file?? open("/home/atlas/data/new/main", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3

Ansuel commented 1 year ago

Additional info:

the segfault happen when -A is used... without it condmv works correctly and the file is moved.

Ansuel commented 1 year ago

@michel-stam found the cause of the seg fault.... it's in mytime_t that changed type and should be casted with %llu instead of %lu... notice this by checking compilation warning... with this fixed, condmv doesn't crash...

preparing a pr for stable ripe-busybox...


this both fix perd and eperd as perd crash by using condmv from crond and eperd crash cause he has it's own condmv functions internally that are affected by the same problem.

michel-stam commented 1 year ago

All of these commands share the same codebase (a derivative of busybox), so it may be some common code.

As to your second question, the -A seems to add a record to the source file prior to renaming to the second file. That record contains the argument to -A. So this seems to be normal.

I've executed the command on the MIPS probe I have next to me:


... initial crud removed ...
open("/home/atlas/data/new/main", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3
ioctl(3, TIOCNXCL, 0x7fe41890)          = -1 ENOTTY (Inappropriate ioctl for device)
brk(0x5f3000)                           = 0x5f3000
write(3, "RESULT 9901 ongoing 1666022532 /"..., 57) = 57
close(3)                                = 0
stat64(0x5f1008, 0x7fe418e0)            = 0
rename("/home/atlas/data/new/main", "/home/atlas/data/out/main") = 0
exit_group(0)                           = ?

As you can see the interesting happens after the open. The IOCTL is different. The code can be found here.

The stat makes sense, followed by some ioctl (which I presume is related to the C library), followed by the write (fprintf), followed by a close and a rename. This is uClibc btw.

In the case where it goes wrong, we're using musl, not uClibc. I've looked at the fdopen function in src/stdio/fdopen.c which is one of the few functions that calls TIOCGWINSZ. It is also called from within fopen( ) in src/stdio/fopen.c so that makes sense. Nothing much happens after the ioctl, so I think we can conclude fopen( ) returns.

I do not see a call to sys_write( ) or syscall( ), which I expect to be called from the fprintf( ) function, so my guess is that something goes wrong there. That would be around here, or an equivalent line in the musl libc src/stdio/__stdio_write.c.

So far I haven't seen any reason why this is the case, though.

Ansuel commented 1 year ago

@michel-stam it seems this target is very restrictive on the long int and long long int cast... the real culprit is this... and makes condmv crash... but checking other file we have wrong cast for time_t all over the code.

Now i'm testing this fix and the /home/atlas link and see if there are crashes...

Ansuel commented 1 year ago

Ok yes can confirm that with the custom busybox data are finally sent and nothing crash anymore

andreagdipaolo commented 1 year ago

After replacing the original busybox with the one provided by @Ansuel it's working fine.

Ansuel commented 1 year ago

@michel-stam i posted the pr that fix this and improve other small thing

michel-stam commented 1 year ago

Hi Ansuel,

I saw that the pr converts time to unsigned long long in some cases, and long long in others. Can you cast everything to unsigned long long?

I will take the patch in when reworked and test it here on a couple of platforms as well.

Regards,

Michel

Ansuel commented 1 year ago

@michel-stam sure! hope it's not a problem changing the %lu to %lld in some part... this is why i had time mixed stuff

Ansuel commented 1 year ago

@michel-stam OK i should have replaced the mixed cast everywhere with unsigned long long and %llu

michel-stam commented 1 year ago

Nice work @Ansuel,

I'll pull these the reworked commit into our git repo and start testing on our platforms.

Thanks!

Regards,

Michel

Ansuel commented 1 year ago

@michel-stam btw are these spike normal? image

andreagdipaolo commented 1 year ago

Mine are even crazier 😅

image

robert-kisteleki commented 1 year ago

Please note the graphs show the network traffic for the host, not only the part sent/received by the probe software, so those peaks likely belong to your (or your OS's) activity. A typical probe has some single-digit kbit/s traffic

Ansuel commented 1 year ago

@robert-kisteleki the txrx report is enabled by default just as an extra info. if the entire network traffic is reported then it makes sense to have these spike.