TritonDataCenter / node-manta

Node.js SDK for Manta
75 stars 54 forks source link

mlogin crashed: "AssertionError: value" #316

Open davepacheco opened 7 years ago

davepacheco commented 7 years ago

@pfmooney was using thoth debug to look at a kernel crash dump and ran into:

> ::stacks
mlogin: AssertionError: value
thoth: debugger exited with code 1
thoth: AssertionError: value

The job looks like it was this one (account uuid elided):

[root@5154ade9 (ops) ~]$ mrjob get b72d5cd2-e264-4b34-c4f7-f0ae7b87a852
       Job b72d5cd2-e264-4b34-c4f7-f0ae7b87a852
  Job name interactive compute job
      User patrick.mooney (...)
     State done
Supervisor eff884b4-f678-4069-8522-1bbe2e4fcb90
   Created 2017-07-28T17:28:08.900Z (4h01m31.216s ago)
      Done 2017-07-28T19:29:23.747Z (2h01m14.847s total)
  Archived 2017-07-28T19:29:26.897Z (2h00m13.220s ago)
  Progress 1 inputs read, 1 tasks dispatched, 1 tasks committed
   Results 1 outputs, 0 errors, 0 retries
   Pending 0 uncommitted done, 0 intermediate objects
   Phase 0 map

The job produced no error. The tail of the log was:

[2017-07-28T17:28:17.972Z]  INFO: medusa-agent/63636 on 9c1566b9-9f41-47f5-a108-d0ed79de08a6: started child process
[2017-07-28T19:29:19.402Z]  INFO: medusa-agent/63579 on 9c1566b9-9f41-47f5-a108-d0ed79de08a6: connection ended (code=NORMAL, reason="remote connection reset")

I think that's consistent with mlogin on the client having blown an assertion in the middle of the session, but that means we've got very little to go on. I can't figure out where this assertion (or even its message) might come from. I don't see obvious candidates in bin/mlogin or lib/client.js's Medusa-related code. And it doesn't seem reproducible: Patrick and I were both able to re-run thoth debug on that dump and run ::stacks, but it completed normally (which I think rules out something about the output of ::stacks). I've also tried dropping fatal signals like SIGABRT onto mdb from within the job, but that causes the session to terminate (mostly) gracefully.

Ideally, we'd like to catch this with DEBUG=1 in the environment or --abort-on-uncaught-exception enabled.