SolidCharity / LightBuildServer

LightBuildServer for building rpm and deb packages and running CI scripts, using linux containers
BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

the next job is sometimes started, even if the current job is still running successfully #130

Closed tpokorra closed 8 years ago

tpokorra commented 8 years ago

this probably causes problems for both lxc and docker. eg. https://lbs.solidcharity.com/logs/tbits.net/kolab-nightly-sync/updatecodeLBS/master/centos/7/amd64/212 is started at 05:19:03, but brutally stopped 43 seconds into the build, because https://lbs.solidcharity.com/logs/tbits.net/kolab-nightly/kolab-utils/master/centos/7/amd64/227 is started at 05:19:30

both jobs are marked on the previous job list with the same finished time: 05:19:56

there is only one build machine configured in this example

tpokorra commented 8 years ago

either the machine is released too early in https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L190, or the two jobs are started concurrently in https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L140?

hanging build (https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L163) is not the case, because BuildingTimeout is 1000 seconds

tpokorra commented 8 years ago

I have modified /etc/logrotate.d/lightbuildserver to weekly, and increased the file size, so that I can see in the log what happens at night

tpokorra commented 8 years ago

perhaps related to visiting the machines page, which does trigger a new build if machines are available? Does it only happen when one machine is just being started up, and the next overrides it?

tpokorra commented 8 years ago

previous job A was stopped after timeout, then another job B is started, and job C is started at the same time.

CheckForHangingBuild: https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L146

CheckForHangingBuild is called in the ProcessBuildQueue: https://github.com/SolidCharity/LightBuildServer/blob/master/lib/LightBuildServer.py#L379

adding sleep after docker restart???

tpokorra commented 8 years ago

there are too many calls for docker stop after a build times out:

and somehow it looks like two jobs are started at the same time:

[pid: 14239|app: 0|req: 8063/8063] 127.0.0.1 () {34 vars in 428 bytes} [Tue Nov 17 05:18:42 2015] GET /processbuildqueue => generated 0 bytes in 3608 msecs (HTTP/1.1 200) 2 headers in 0 bytes (0 switches on core 1)
[00:00:00] now running: ssh -f -o "StrictHostKeyChecking no" -p 22 -i /etc/lightbuildserver/container/container_rsa root@build03.lbs.solidcharity.com "export LANG=C; systemctl restart docker && sleep 60 2>&1; echo \$?"
[00:00:00] now running: ssh -f -o "StrictHostKeyChecking no" -p 22 -i /etc/lightbuildserver/container/container_rsa root@build03.lbs.solidcharity.com "export LANG=C; systemctl restart docker && sleep 60 2>&1; echo \$?"

That makes sense, because the machine was set to being available two times, while it was still being stopped.

Adding new state STOPPING should solve this problem...