Hypfer / Valetudo

Cloud replacement for vacuum robots enabling local-only operation
https://valetudo.cloud
Apache License 2.0
6.38k stars 388 forks source link

Fix Nodejs getaddrinfo segfaults on second call #680

Closed Hypfer closed 3 years ago

Hypfer commented 3 years ago

If you're a C wizard looking to contribute to Valetudo, this issue is for you

The problem

Since we've moved to a statically linked nodejs 14.4.0 base binary, Valetudo (and everything else built with that) segfaults on every second getaddrinfo syscall. This also happens with a statically linked nodejs 12 binary which makes it very likely that the static linking is the culprit here.

Interestingly, everything works fine when doing DNS lookups by talking to the DNS Server directly which is why the segfaulting dns.lookup method is currently monkey-patched to instead call dns.resolve4 which doesn't crash the process.

Context

Valetudo is using pkg to package the js code + the nodejs runtime into a single binary for easy deployment and updates.

After issues appeared regarding different libc versions etc, the project moved from the official dynamically linked pkg base binaries to statically-linked ones. These are hackishly built using my fork of pkg-fetch

Minimal sample application

Attached to this issue is a minimal sample project which demonstrates this issue. There's also a binary inside which you can simply throw on your robot vacuum and observe it crashing:

const dns = require("dns");

console.log("Hello and welcome to my TED Talk");

setInterval(() => {
    dns.lookup("dontvacuum.me", (err, res) => {
        console.log("Lookup complete:", err, res);
    })
}, 1000)
[root@rockrobo ~]# /dev/shm/static_node_dns_segfault_minimal
Hello and welcome to my TED Talk
Lookup complete: null 18.25.131.166
Segmentation fault (core dumped)

static_node_dns_segfault_minimal.zip

Here's another crash log from valetudo running on a dreame vacuum

[   74.765078] valetudo: unhandled page fault (11) at 0x00000062, code 0x017
[   74.765089] pgd = c9004000
[   74.765093] [00000062] *pgd=49ef9835, *pte=00000000, *ppte=00000000
[   74.765109] CPU: 1 PID: 1372 Comm: valetudo Not tainted 4.9.118+ #3
[   74.765112] Hardware name: sun8iw15
[   74.765116] task: c1069180 task.stack: c9e62000
[   74.765123] PC is at 0xb67629e4
[   74.765126] LR is at 0xb67629cc
[   74.765132] pc : [<b67629e4>]    lr : [<b67629cc>]    psr: 600f0010
[   74.765132] sp : b37fd448  ip : b6778060  fp : b37fd878
[   74.765136] r10: b37fe814  r9 : b37fd878  r8 : b37fd878
[   74.765140] r7 : 00000002  r6 : b37fd878  r5 : 000003e8  r4 : b37fd860
[   74.765144] r3 : b37fd879  r2 : 00000062  r1 : 00000031  r0 : 00000000
[   74.765149] Flags: nZCv  IRQs on  FIQs on  Mode USER_32  ISA ARM  Segment user
[   74.765154] Control: 10c5387d  Table: 4900406a  DAC: 00000055
[   74.765161] CPU: 1 PID: 1372 Comm: valetudo Not tainted 4.9.118+ #3
[   74.765164] Hardware name: sun8iw15
[   74.765186] [<c010ec90>] (unwind_backtrace) from [<c010af08>] (show_stack+0x10/0x14)
[   74.765197] [<c010af08>] (show_stack) from [<c02ca308>] (dump_stack+0x70/0x8c)
[   74.765208] [<c02ca308>] (dump_stack) from [<c01110e0>] (__do_user_fault+0x44/0x88)
[   74.765217] [<c01110e0>] (__do_user_fault) from [<c01113a8>] (do_page_fault+0x284/0x2a8)
[   74.765226] [<c01113a8>] (do_page_fault) from [<c01012b8>] (do_DataAbort+0x38/0xb8)
[   74.765233] [<c01012b8>] (do_DataAbort) from [<c010bbdc>] (__dabt_usr+0x3c/0x40)
[   74.765237] Exception stack(0xc9e63fb0 to 0xc9e63ff8)
[   74.765243] 3fa0:                                     00000000 00000031 00000062 b37fd879
[   74.765250] 3fc0: b37fd860 000003e8 b37fd878 00000002 b37fd878 b37fd878 b37fe814 b37fd878
[   74.765256] 3fe0: b6778060 b37fd448 b67629cc b67629e4 600f0010 ffffffff

If there's more information I could provide to fix this, please tell me. As JS is all I do, this problem is far beyond me

ccoors commented 3 years ago

After some debugging I'd say it looks like this bug from 2009: https://sourceware.org/bugzilla/show_bug.cgi?id=10652

I was able to reproduce it on x86_64.

Thread 7 "valetudo" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff57c2640 (LWP 422382)]
0x00007ffff46bb414 in __nss_readline () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff46bb414 in __nss_readline () from /usr/lib/libc.so.6
#1  0x00007ffff57e44cd in ?? () from /usr/lib/libnss_files.so.2
#2  0x00007ffff57e56f4 in _nss_files_gethostbyname4_r () from /usr/lib/libnss_files.so.2
#3  0x00000000016eb022 in gaih_inet.constprop ()
#4  0x00000000016ebdd9 in getaddrinfo ()
#5  0x0000000000ebc381 in uv__getaddrinfo_work (w=0x21b7b60) at ../deps/uv/src/unix/getaddrinfo.c:106
#6  0x0000000000eb08b9 in worker (arg=0x0) at ../deps/uv/src/threadpool.c:122
#7  0x00000000016507c9 in start_thread ()
#8  0x00000000016f1723 in clone ()
(gdb) 

There will probably not be a fix for that any time soon. I tried building nodejs with musl, which I couldn't get to work in a reasonable amount of time. Nodejs only has experimental support for musl on x86.

Maybe there's some other way to do DNS lookups?

Hypfer commented 3 years ago

Oh. Well at least we now know what is going on there and it's not our fault so thats something.

Maybe there's some other way to do DNS lookups?

There is! https://github.com/Hypfer/Valetudo/blob/256b9076a3727e3f85bfc8412e296e8009a42052/lib/DnsHack.js

That's what this is doing. I just hoped that there was a better fix than monkey patching core libs.

ccoors commented 3 years ago

There is!

I meant something less hacky. The only option I see would be parsing nslookup output, but that feels even more hacky.