cosmonic-labs / netreap

A Cilium controller implementation for Nomad
https://netreap.io
Apache License 2.0
130 stars 8 forks source link

[BUG] Netreap exits with zero code when shouldn't #31

Open DevKhaverko opened 1 year ago

DevKhaverko commented 1 year ago

Describe the bug

After failing on node reaper or endpoint reaper netreap exits with 0 code

To Reproduce

Steps to reproduce the behavior:

  1. Run netreap as system job
  2. In one time you can get the error Got error message from node event channel: {"error" : invalid character 'e' looking for beginning of value"}

Expected behavior

Netreap exits with non-zero code, so nomad sees it like failed allocation, not like normal exiting.

Environment (please complete the following information)

If you ran into this issue while developing a feature for Netreap:

v1.13.2

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

I think nomad sends wrong event and it's bug of nomad, but netreap should handle it correctly

deverton-godaddy commented 1 year ago

Can you run with NETREAP_DEBUG set to 1 and see if you can reproduce the error? That should log a lot more context so we can try and track this down.

DevKhaverko commented 1 year ago

NETREAP_DEBUG was set to 1

DevKhaverko commented 1 year ago

log from this line zap.L().Debug("Got error message from node event channel", zap.Error(events.Err))

DevKhaverko commented 1 year ago

it's hard to debug why nomad sends events with error sometimes maybe you don't need to shutdown netreap when this error happens? what do you think?

DevKhaverko commented 1 year ago

I've opened PR https://github.com/cosmonic/netreap/pull/34

deverton commented 9 months ago

What's your experience with this patch? When we tried it we found the reaper gets stuck in an infinite loop since the event stream seems to be broken at that point.

EDIT: Nevermind, I think this just a silly bug on my branch.