No heart beat sent to controller when worker is pulling an image

cirruslabs / orchard

Orchestrator for running Tart Virtual Machines on a cluster of Apple Silicon devices

Other

189 stars 13 forks source link

No heart beat sent to controller when worker is pulling an image #161

Closed eecsmap closed 3 months ago

eecsmap commented 3 months ago

I have my Jenkins dynamically allocate VM instance on-demand. When a VM allocated to a worker, and the VM is not in the worker's cache, the worker starts pulling the image, which is fine. However, since the image is a bit large (~50GB), it will take more than 3 minutes to pull. And during this period of time, the image is marked as pending, and the worker stops sending heart beat. After 3 minutes, the worker will be considered disconnected and the VM will be marked as failed.

So, can we put the heart beat in a separate thread, and add status pulling for VM?

ruimarinho commented 3 months ago

I had a small health check on the client side and noticed I was restarting the orchard worker due to the lack of a heartbeat. Turns out it was due to pulling an updated image layer. This would be a welcome improvement!

edigaryev commented 3 months ago

Please check out the new 0.16.1 version that was just released, it now features asynchronous VM creation that should help with the issue you're experiencing.

eecsmap commented 3 months ago

I update the controller with 0.16.1

orchard --version
orchard version 0.16.1-510a259

And created test_async. Yet noticed the heatbeat of worker 004 is still paused by the pending VM.

orchard list vms
Name                                            Created         Image                                                                                           Status  Restart policy          Assigned worker
test_async                                      2 minutes ago  oci-registry-dev-local/macos-ventura-vanilla             pending OnFailure (0 restarts)  mac-studio-004.local
dev@macstudio001 ~ % orchard list workers
Name                    Last seen       Scheduling paused
mac-studio-001.local 1 second ago    false
mac-studio-002.local 3 seconds ago   false
mac-studio-003.local 2 seconds ago   false
mac-studio-004.local 2 minutes ago   false

edigaryev commented 3 months ago

I update the controller with 0.16.1

Could you please update the worker too?

We probably should clarify this better on the release notes.

eecsmap commented 3 months ago

works! Thanks.