Closed tpokorra closed 8 years ago
either the machine is released too early in https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L190, or the two jobs are started concurrently in https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L140?
hanging build (https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L163) is not the case, because BuildingTimeout is 1000 seconds
I have modified /etc/logrotate.d/lightbuildserver to weekly, and increased the file size, so that I can see in the log what happens at night
perhaps related to visiting the machines page, which does trigger a new build if machines are available? Does it only happen when one machine is just being started up, and the next overrides it?
previous job A was stopped after timeout, then another job B is started, and job C is started at the same time.
CheckForHangingBuild: https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L146
CheckForHangingBuild is called in the ProcessBuildQueue: https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L379
adding sleep after docker restart???
there are too many calls for docker stop after a build times out:
and somehow it looks like two jobs are started at the same time:
[pid: 14239|app: 0|req: 8063/8063] 127.0.0.1 () {34 vars in 428 bytes} [Tue Nov 17 05:18:42 2015] GET /processbuildqueue => generated 0 bytes in 3608 msecs (HTTP/1.1 200) 2 headers in 0 bytes (0 switches on core 1)
[00:00:00] now running: ssh -f -o "StrictHostKeyChecking no" -p 22 -i /etc/lightbuildserver/container/container_rsa root@build03.lbs.solidcharity.com "export LANG=C; systemctl restart docker && sleep 60 2>&1; echo \$?"
[00:00:00] now running: ssh -f -o "StrictHostKeyChecking no" -p 22 -i /etc/lightbuildserver/container/container_rsa root@build03.lbs.solidcharity.com "export LANG=C; systemctl restart docker && sleep 60 2>&1; echo \$?"
That makes sense, because the machine was set to being available two times, while it was still being stopped.
Adding new state STOPPING should solve this problem...
this probably causes problems for both lxc and docker. eg. https://lbs.solidcharity.com/logs/tbits.net/kolab-nightly-sync/updatecodeLBS/master/centos/7/amd64/212 is started at 05:19:03, but brutally stopped 43 seconds into the build, because https://lbs.solidcharity.com/logs/tbits.net/kolab-nightly/kolab-utils/master/centos/7/amd64/227 is started at 05:19:30
both jobs are marked on the previous job list with the same finished time: 05:19:56
there is only one build machine configured in this example