Intermittent 'unable to open supervise/ok: file does not exist' errors

gshively11 commented 9 years ago

I was using runit 1.7.2 and recently started getting unable to open supervise/ok: file does not exist errors when the chef-client run got to a step in the logstash cookbook which tried to restart a logstash service configured to use runit. Not sure if these errors started appearing when chef client got updated to 12.4.

I'm now pinned at 1.5.18 and haven't seen the error anymore.

Running on Centos 6.6

justizin commented 9 years ago

I'm pretty sure this is a race condition, I have it on Ubuntu, Debian, and CentOS, but illustrate a minimum case with Ubuntu in great detail below.

I'm also having this problem with both zookeeper and kafka cookbooks. I thought I solved it by freezing to 1.6.0, but that also seemed to experience this, if slightly less frequently.

I have a wrapper cookbook which spins up a triple of zookeeper / kafka in Vagrant at:

https://github.com/bitmonk/bitmonk_kafka

I also produced a minimal cookbook / vagrant / kitchen environment to illustrate the minimal case at:

https://github.com/bitmonk/bitmonk_runit_wtf/

The output of 'kitchen test' on my mac is in a gist at:

https://gist.github.com/bitmonk/c40729ba454aa94c44b7

I also ran a repeat 'vagrant provision' after the 'vagrant up' failed, which is how I get my zookeeper / kafka triple up, and captured the output in a gist. The command follows:

Justin-Ryans-MacBook-Pro:bitmonk_runit_wtf juryan$ while ! vagrant up; do
    vagrant provision; done 2>&1 | tee ~/runit_wtf.txt

https://gist.github.com/bitmonk/f7bd564f47495a23ad33

ashmckenzie commented 9 years ago

As a workaround for now, the following works:

runit_service "myservice" do
  # ...
  sv_bin 'sleep 5 && /usr/bin/sv'
  action  :enable
end

justizin commented 9 years ago

I thought about something like that, I have a feeling that isn't going to be accepted in upstream community cookbooks. :/

ashmckenzie commented 9 years ago

Haha you'd be surprised! ;)

I've only actually seen this hit us hard when doing our integration tests. I have the sv_bin command wrapped in 'if testing' block so it's not all that bad but would be nice to get rid of it...

cwjohnston commented 9 years ago

@gshively11 @ashmckenzie @bitmonk can you please try reproducing this issue with current code in develop branch? I expect this issue is fixed in that pre-release branch.

ashmckenzie commented 9 years ago

@cwjohnston I just re-ran the our tests with the develop branch and success! Nice work :)

justizin commented 9 years ago

@ashmckenzie - this has been happening on community zookeeper and kafka cookbooks during 'vagrant up', which is particularly harsh when you're spinning up a triple. :)

@cwjohnston - Thanks, I'll give this a shot today! My example case fails on CentOS, but succeeds on Ubuntu - I'll try this with my actual zookeeper / kafka cookbooks today.

justizin commented 9 years ago

@cwjohnston - it's looking pretty good, is there any reason you prefer to let the current release of 'runit' supercede what's in the 'develop' branch?

I'm trying to get dependent changes in upstream branches, and I prefer that, say, 'depends "runit"' in metadata.rb of a community cookbook be sufficient to pass tests on Ubuntu, Debian, and CentOS.

My kafka and zookeeper stuff are looking pretty OK right now.

Do you have any concerns about regression for other platforms or configurations?

My impression is basically that this code is not a problem for running systems, it just makes me run, 'vagrant provision', a lot, which is a problem for sending this out to my team to encourage them to start logging to kafka.

I worked on one of the oldest legacy alpha sites for Chef at Wikia, and I understand why these problems occur and are difficult to propagate, because I don't want to break anyone's staging or production to fix my vagrant, but I'm trying to see to it that we all have all of those type things working.

Let me know if there's any way I can help to move this along! I'm happy to help cut releases.

maraca commented 9 years ago

+1! Any chance to get this released?

punnie commented 9 years ago

Pinned at 1.5.18 as well. Using Ubuntu 14.04. Been hunting this bug for quite a while. Any chance to get the fix released?

Thanks for all the work here! Cheers! :wink:

xmik commented 9 years ago

@punnie, could you try using my branch from https://github.com/hw-cookbooks/runit/pull/148 ? I've been using it for a while in docker and it works fine. Or do you mean that you use some other solution from this thread and want to have it released?

justizin commented 9 years ago

I'm still pinned to the 'develop' branch, so would be nice to see what's in there released to supermarket with a version bump.

fishnix commented 9 years ago

:+1: Seeing this in 1.7.2, would love to see a release so I don't have to pin a SHA

vitalis commented 9 years ago

:+1: release with fix will really be great

justizin commented 9 years ago

@cwjohnston would love to offer my time to see a fresh release of the runit cookbook, esp since this issue is now pinned to issues in several other community cookbooks and seems to fairly universally impact folks spinning up new nodes with the latest release cookbook.

cwjohnston commented 9 years ago

Hi folks. @chrisroberts and I have scheduled time this week to work on these and other outstanding issues, with the aim of cutting a new release from develop very soon. Thanks for your patience!

cullenmcdermott commented 9 years ago

@cwjohnston has there been any movement on this?

TimSimmons commented 9 years ago

+1 Please release :)

itstayyab commented 9 years ago

+1

ghost commented 9 years ago

+1

punnie commented 9 years ago

@xmik thanks for the suggestion. I'm currently using version 1.5.x, which appears not to have this problem, or at least works for me™.

If you want I may check that out at a later date and report back, but for now I really need some development speed, and my current solution works. :wink:

xmik commented 9 years ago

@punnie, no rush as I am not in this context right now, but I would appreciate some feedback. I'm still using that branch and it works for me.

cwjohnston commented 9 years ago

Fix in #138, released in v1.7.4

patrickcat commented 7 years ago

This is still happening with ubuntu 16, 14, with chef 12.14.x, 12.3.1.x . Why it's status is 'fixed' here? This list is what I did:

download the chef server img
dpkg -i chef*.deb
chef-server-ctl reconfigure and then error happens. It's so un-friendly stuff.

Running handlers: [2017-04-27T04:48:29+00:00] ERROR: Running exception handlers Running handlers complete [2017-04-27T04:48:29+00:00] ERROR: Exception handlers complete Chef Client failed. 2 resources updated in 01 minutes 23 seconds [2017-04-27T04:48:29+00:00] FATAL: Stacktrace dumped to /opt/opscode/embedded/cookbooks/cache/chef-stacktrace.out [2017-04-27T04:48:30+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: execute[/opt/opscode/bin/private-chef-ctl start rabbitmq] (private-chef::rabbitmq line 105) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1' ---- Begin output of /opt/opscode/bin/private-chef-ctl start rabbitmq ---- STDOUT: warning: rabbitmq: unable to open supervise/ok: file does not exist STDERR: ---- End output of /opt/opscode/bin/private-chef-ctl start rabbitmq ---- Ran /opt/opscode/bin/private-chef-ctl start rabbitmq returned 1

chef-cookbooks / runit

Intermittent 'unable to open supervise/ok: file does not exist' errors #142