buildbot / buildbot-infra

Buildbot infrastructure
MIT License
22 stars 23 forks source link

Buildbot infrastructure instability caused by time synchronization on latent workers #269

Closed pmisik closed 1 month ago

pmisik commented 8 months ago

Hi

I guess there is Buildbot infrastructure instability caused by time synchronization on latent workers. On latent workers p12-pd-?? I'm seeing bizarre errors that seem to be time sync related. It looks as if the time synchronization occurred during the execution of steps. Reasons why I suspect time sync issue is that I randomly seeing these problems:

@p12tic what do you think?

verm commented 8 months ago

I just checked and it seems that ntpd was started incorrectly on service3 but I don't see anything in the logs about there being any time issues when I did restart it the adjustment was microseconds.

pmisik commented 8 months ago

I’m not sure if you use VM's for running worker machines. Since the time offset was significant (16497 seconds=04:34:57), I wonder if this is one of the issues with time synchronization you can have on the VM infrastructure (at least I've encountered them).

For example, here https://buildbot.buildbot.net/#/builders/108/builds/2120 is an interesting situation where there is probably a time shifted twice.

p12tic commented 8 months ago

Interesting. These workers are on a machine I boot up when I want faster test execution. Recently I migrated them to podman containers using gVisor container runtime. Probably gVisor doesn't fake syscalls well enough.

verm commented 8 months ago

Just checking in this isn't an issue with time on the master? Sounds like it's not but I want to make sure if there's anything I need to do let me know.

p12tic commented 8 months ago

@verm There's no issues with time on master. For any issues in p12-* workers the worker setup is the first suspect.

verm commented 8 months ago

@p12tic okay great!

pmisik commented 7 months ago

Now, it looks like p12-pd-? workers have run out of disk space for /home because errors like

error Error: ENOSPC: no space left on device, mkdir '/home/buildbot/... https://buildbot.buildbot.net/#/builders/126/builds/128 https://buildbot.buildbot.net/#/builders/122/builds/1120

WARNING: Building wheel for buildbot failed: [Errno 28] No space left on device: '/home/buildbot/.cache/pip/wheels/62' https://buildbot.buildbot.net/#/builders/127/builds/132

This applies at least to

p12tic commented 1 month ago

This is no longer a problem, closing.