TritonDataCenter / manta-thoth

Thoth is a Manta-based system for core and crash dump management
16 stars 7 forks source link

TOOLS-2440 need a job-free thoth #181

Closed jlevon closed 4 years ago

jlevon commented 4 years ago

Hi all, I think this is hopefully ready for review. I haven't added testing notes yet, so please let me know if/when you'd like to review those. I'd also appreciate if people could actually try this out on the client side:

npm install joyent/manta-thoth#TOOLS-2440

Note the README changes for the user-visible parts (mainly that crash-dump upload now needs a post-process step).

I have also used sdc-thoth-install on my lab rig to set that up, and check that both the HN and my CN are uploading OK on the regular cron.

jlevon commented 4 years ago

I'd prefer not to add license headers as part of this, at least. It seems common in our node.js repos not to have the license in every file (e.g. node-manta is like this).

jlevon commented 4 years ago

I made a couple of additional fixes on top of code review comments. On my test system I hit a core dump with a stack too long for rethinkdb to index the JSON, which uncovered a couple of issues with the existing error handling.

I think this is ready for another code review pass. Thanks.

jlevon commented 4 years ago

For some reason I can't reply to your exact comment, but I'm not planning to remove the --abort flag right now. AIUI it doesn't break Mac, and I see no reason to try to support Linux explicitly.

jlevon commented 4 years ago

Is that clearer @bahamat ?

jlevon commented 4 years ago

Removed the last two FIXMEs. Don't run "make publish" as that will now over-write the latest mainline thoth (it's not ideal...)

I can't test jobs atm as they seem to be down (no servers available)

jlevon commented 4 years ago

Re the MDB version thing: you seem to be running on a PI pre-dating d70f65dfb86dedc271c6eacf5767889026db880c (April 2019). In general that's not going to work out too well, especially for crash dumps. Just a particular downside of local debugging. Another reason for a shared debugging host.

Having said that, MDB module API version changes are rare.

jlevon commented 4 years ago

Thanks Trent, I took your patches and fixed up the other thing.

trentm commented 4 years ago

you seem to be running on a PI pre-dating d70f65dfb86dedc271c6eacf5767889026db880c (April 2019).

[root@5d4f7599-a991-6b35-dd44-d91936957a6b ~]# uname -v
joyent_20181206T011455Z

I guess this is my fault for hacking my builder zone onto the staging-1 headnode where there was capacity. That headnode happens to be the Triton server being used to test "min_platform" compatibility of components. Boo.

trentm commented 4 years ago

I don't think this need hold up review, but I get a "spawn failed with return code ..." when Ctrl+Ding to exit thoth debug .... E.g.:

[root@trent-builder-1940-x86_64-20200221T185911 ~/joy/manta-thoth]# THOTH_NO_JOBS=1 thoth debug 6d0476cee1215862
thoth: downloading core.node.270503 to local cache
thoth: core.node.270503                  [=====================================================================================>] 100%   1.96GB  13.82MB/s  2m25s
mdb_v8 version: 1.4.1 (release, from 0cd139c)
V8 version: 4.5.103.53
Autoconfigured V8 support from target
C++ symbol demangling enabled
> ::jsstack
native: libc.so.1`_lwp_kill+0x15
native: libc.so.1`raise+0x2b
native: libc.so.1`abort+0x10e
native: libstdc++.so.6`__gnu_cxx::__verbose_terminate_handler+0x185
native: libstdc++.so.6`__cxxabiv1::__terminate+0x17
native: libstdc++.so.6`__cxxabiv1::__unexpected
        (1 internal frame elided)
        (1 internal frame elided)
native: libstdc++.so.6`operator new[]+0x1a
native: int node::StreamBase::WriteString<+0x18f
native: void node::StreamBase::JSMethod<node::StreamWrap, &+0xaa
        (1 internal frame elided)
js:     createWriteReq
js:     <anonymous> (as Socket._writeGeneric)
js:     writeOrBuffer
js:     <anonymous> (as OutgoingMessage._writeRaw)
js:     <anonymous> (as OutgoingMessage._send)
js:     <anonymous> (as OutgoingMessage.write)
        (1 internal frame elided)
js:     _cb
js:     formatText
js:     format
js:     send
        (1 internal frame elided)
js:     _sendMetrics
js:     _afterGetMetrics
        (1 internal frame elided)
js:     <anonymous> (as next)
        (1 internal frame elided)
js:     _onGetInfo
js:     getZoneInfo
js:     _confirmSanity
js:     <anonymous> (as <anon>)
js:     processImmediate
        (1 internal frame elided)
        (1 internal frame elided)
native: v8::internal::Execution::Call+0xff
native: v8::Function::Call+0xd7
native: v8::Function::Call+0x3c
native: node::MakeCallback+0xfa
native: node::CheckImmediate+0xa2
native: uv__run_check+0x74
native: uv_run+0x12f
native: node::Start+0x59d
native: main+0x42
native: _start+0x83
>

dump file kept: /var/tmp/thoth/cache/6d0476cee1215862048b12dec9eb3636/core.node.270503
thoth: Error: spawn failed with return code 134
    at ChildProcess.<anonymous> (/root/joy/manta-thoth/bin/thoth:2004:9)
    at ChildProcess.emit (events.js:311:20)
    at maybeClose (internal/child_process.js:1021:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:286:5)

I don't get that error exit code if I just Ctrl+D without any dcmds. If I exit the mdb shell after ::jsstack or ::stack (or I imagine other commands), then I get that error exit.

jlevon commented 4 years ago

Looks like you found an mdb or v8 bug:

$ mdb ~/core.node.270503 
Loading modules: [ libumem.so.1 libc.so.1 libnvpair.so.1 ld.so.1 ]
> ::load /home/gk/bad.v8.so
mdb_v8 version: 1.4.1 (release, from 0cd139c)
V8 version: 4.5.103.53
Autoconfigured V8 support from target
C++ symbol demangling enabled
> ::stack
libc.so.1`_lwp_kill+0x15(1, 6, 2df, fe9c5000, fe9c5000, 1)
libc.so.1`raise+0x2b(6)
libc.so.1`abort+0x10e()
libstdc++.so.6`__gnu_cxx::__verbose_terminate_handler+0x185(feea273b, feed5a9c, feea355b, feed5a9c, 9460d70, 94d9ac8)
libstdc++.so.6`__cxxabiv1::__terminate+0x17(feea6680, 1, 8043318, feea35f7, feea35e9, feed5a9c)
libstdc++.so.6`__cxxabiv1::__unexpected(9460d70, fef00600, 8043338, feea387f, feed5a9c, 34a4a1)
0xfeea38ae(9460d90, feef2620, feea1170, feea4e75, feed5a9c, 9727cf0)
0xfeea4ebc(34a4a1, 0, 0, 8047438)
libstdc++.so.6`operator new[]+0x1a(34a4a1, 8047480, 1, b9df4745, b, b9df4675)
int node::StreamBase::WriteString<+0x18f(9727cf0, 8047438, 947f000, 0, 8047488, 92fd55f2)
void node::StreamBase::JSMethod<node::StreamWrap, &+0xaa(8047438, 8f808255, 8047464, 8047484, 2, 0)
0x92f60a34(fd4e2049, 943c008, 8f808099, 8f808099, 8f86c7dd, 95dd5acd)
0x92fcd5e3(8f8651d5, b9df5701, fd4e2049, b9df573d, 8f808099, fce960f1)
0x92f806ef(fd4ad12d, 8f8651d5, b9df5701, 8f808231, fd4e2c3d, fd4ad12d)
0x92fcd977(ac1a8549, 8f8651d5, b9df5701, fd4eaa75, fd4e2c3d, 8f808099)
0x92f1d34d(8f808099, 8f808099, b9df5701, fd4ed42d, b9df5701, b9df56ed)
0xa7641a82(8f808099, 8f808099, 82208081, fd4ed42d, 95dedcb1, b656bd05)
0xa76478ba(8f808099, 8f808099, 82208081, fd4ed42d, 2, 95dedcb1)
0x8061a143(82208081, fd4ed42d, 95dedcb1, b9df4dfd, b9df4dd1, 804764c)
0x92f79652(82208081, 8f808089, 8f808099, 82208081, 8f824719, 0)
0xa7630743(b9df4dfd, 82208081, fd4ed42d, fd4e2cf5, fd4ed42d, a4e4110d)
0xa763132b(b9df4dfd, 82208081, fd4ed42d, 82208081, b65b1d4d, b9df4dfd)
0xa762ff65(8f808099, 8f808099, 82208081, fd4ed42d, 2, b65b1d4d)
0x8061a143(82208081, fd4ed42d, b65b1d4d, fd4f4a71, a4e87bb1, fd4f4a71)
0x92f78251(82208081, 8f808089, 8f808099, fd4f4a71, fd4f4d95, fd4f4d95)
0x92f77f24(8f808089, 8f808099, 4, fd4f4d95, 14, 8047768)
0x8061a143(fd4f4de5, 8f808089, 8f808099, fd4f4d95, fd4f67e5, fd4f49d9)
0x92f0fc5c(8f808099, 8f808089, 8f808099, 2, fd4f49d9, 14)
0x8061a143(8f808089, 8f808099, fd4f49d9, b9df488d, b9df488d, b9df486d)
0x92f77e64(b9df48b1, 8f808089, 8f808099, b9df488d, a4ea282d, b9df4921)
0x92f470d4(b9df488d, fcefaa21, abe0eee9, a4ea282d, fd4f66a5, fd4f66a5)
0x92f77bf8(fd4f49d9, fd4f4dc9, fd4f67e5, fd4f66a5, b9df47dd, b9df47ad)
0x92f10f8a(b9df4831, b9df47dd, ac1673b9, 8f808211, b9df4831, 8f808099)
0x92f10adf(8f86c735, ac1a7385, 8061a921, 10, 0, 8047898)
0x8061a9e1(0, 0, 2, 0, 947e010, 943c008)
0x8061999f(92f10660, ac1a7385, 8f86c735, 0, 0, 943c008)
v8::internal::Execution::Call+0xff(804794c, 943c008, 947e010, 9496cd8, 0, 0)
v8::Function::Call+0xd7(80479cc, 947e010, 947e028, 9496cd8, 0, 0)
v8::Function::Call+0x3c(8047a38, 947e010, 9496cd8, 0, 0, 943c008)
node::MakeCallback+0xfa(8047abc, 94d9ac8, 9496cd8, 947e010, 0, 0)
node::CheckImmediate+0xa2(94d9ad0, 0, 838c6bf, 17c5, 9434f98, 1dc859d6)
uv__run_check+0x74(9366e40, 0, 4, 0, 9366f3c, 9366e50)
uv_run+0x12f(9366e40, 1, fe956890, 8335a3e)
node::Start+0x59d(2, 8047cfc, 4, 400, 9355ff4, 8047ca8)
main+0x42(8047cbc, fe9d2388, 8047cf0, 8314543, 3, 8047cfc)
_start+0x83(3, 8047df0, 8047e4c, 8047e4c, 0, 8047e8b)
> $q

Abort (core dumped)
$ pfexec pstack $(ls -rt /cores/core.mdb* | tail -1 )
core '/cores/core.mdb.880473' of 880473:    mdb /home/gk/core.node.270503
 fee5d1b3 syscall  (3, fed52bcc, 0, fedec75a, fee2fc79, feffca40) + 13
 fee49db8 thr_sigsetmask (2, 8045310, 0) + 1f2
 fee49e33 sigprocmask (2, 8045310, 0) + 40
 fee2fce1 sigrelse (6) + 68
 fea743bc umem_do_abort () + 38
 fea7448e __umem_assert_failed (fea80e7d, fea810f7, 816d688)
 fea76a2c process_free (816d688, 1, 0) + 74
 fea76d93 umem_malloc_free (816d688) + 1a
 0809ecdf mdb_free (816d688, 38) + 20
 080973d8 strfree  (816d688) + 1d
 08084cf3 mdb_module_remove_walker (8164b70, 88200bd) + 65
 08084e45 mdb_module_unload_common (81ea440) + 125
 080853d1 mdb_module_unload (81ea440, 2) + e
 08084f5e mdb_module_unload_all (2) + 20
 0806802a mdb_destroy () + 35
 08081260 terminate (0) + 10
 080827a6 main     (80461fc, feed7528, 8046238) + 1177
 080646f7 _start_crt (2, 8046268, fefd0094, 0, 0, 0) + 96
 080645ca _start   (2, 8046538, 804653c, 0, 8046556, 8046562) + 1a
jlevon commented 4 years ago

Hi all, please take a look at the last change too, so we can now update directly from an older thoth. I tested this on my rig