Closed holgero closed 7 years ago
@holgero if you run monit reload
on the director node. does it put the state back to running
? there is potentially a monit problem that makes it confused about the state.
I didn't try monit reload
, but after I did a sv restart monit
, monit indeed reported the state of all jobs (including the director) as running
.
monit restart all
or monit restart director
also fixed the problem. However, the question remains why the director start took more than 30s?
I think I've seen the same behavior installing a BOSH Director via bosh-init
to a t2.nano instance on AWS. I can try to replicate.
Update 6/2/2016: I tried deploying several times, and was unable to replicate failure (i.e. bosh-init deploy
succeeded every time)
Current suspect: creation of certificates in the director_nginx
takes too long due to too little entropy: https://github.com/cloudfoundry/bosh/blob/master/release/jobs/director/templates/nginx_ctl#L31-L36
Closing - feel free to reopen if you have more information or tested the entropy theory.
Just verified that this has nothing to do with entropy: Even with user-provided certificates the Director job fails. monit reload
works, afterwards the Director job is shown as running
.
Increasing the VM size didn't work, decreasing the amount of workers also didn't work. Any further ideas, @cppforlife?
After debugging with @cppforlife we've most likely identified the culprit:
/var/vcap/store/director
, which contains the log, debug log, and result for all tasks executed on a directorfailed
, although it eventually comes up correctlySolution: move creation and ownership changes to pre-start, which may take as long as it wants. Already done by other releases, such as consul-release
Story in our backlog to get it fixed: https://www.pivotaltracker.com/story/show/136281459
@cppforlife @tylerschultz We just pushed a commit to develop to fix this.
We manually tested this during an update process. A file which we gave different chmod
attributes before updating the director had vcap:vcap
after the update.
As chmod
does not print any output to stdout, the pre-start.stdout.log
is empty.
Prior to the change, the chmod
action didn't write anything into director.stdout.log
either.
Could you create a new BOSH release from develop or create a hotfix with these changes? After that, the issue can be closed.
we ll create a 260.1 today.
Sent from my iPhone
On Dec 19, 2016, at 6:58 AM, Tom Kiemes notifications@github.com wrote:
@cppforlife @tylerschultz We just pushed a commit to develop to fix this.
We manually tested this during an update process. A file which we gave different chmod attributes before updating the director had vcap:vcap after the update. As chmod does not print any output to stdout, the pre-start.stdout.log is empty. Prior to the change, the chmod action didn't write anything into director.stdout.log either.
Could you create a new BOSH release from develop or create a hotfix with these changes? After that, the issue can be closed.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Fixed in v260.1.
I tried to update the stemcell (3212 to 3232.4) and ran bosh-init for that. The deployment went as usual until near the end but then it failed at the point where it waited for the instance to be running:
Waiting for instance 'bosh/0' to be running... Failed
But the bosh director responded just fine to all requests afterwards (
bosh status
orbosh deployments
worked as expected). When I looked into the director VM itself, I saw thatmonit summary
reportedExecution failed
for the bosh director although the process was running, listening on port 25555 and it had the PID that stood in the pid-file under/var/vcap/sys/run/director/director.pid
. In the monit log file (/var/vcap/monit/monit.log
) I saw that the director was mentioned as failed about 30 seconds after it was started, but there was another entry about 5 seconds later that it was started successfully.Here is the deployment manifest I used: