SCIInstitute / shapeworks-cloud

A web version of ShapeWorks Studio
https://www.shapeworks-cloud.org/#/
Apache License 2.0
1 stars 0 forks source link

Scaling GPU workers #357

Closed annehaley closed 5 months ago

annehaley commented 7 months ago

This PR is scoped to the following tasks:

EDIT: After merging, testing, and making additional changes (linked at the bottom of this PR), the following video was captured to exhibit the new behavior.

https://github.com/girder/shapeworks-cloud/assets/44912689/19fe6dfb-2473-4115-afc7-601a4c1dc0ef

annehaley commented 7 months ago

@manthey I was able to mock this successfully after our last discussion. Could you please test this out?

After calling docker compose up, you should see that every 90 seconds, 5 mock "deepssm" tasks are requested. Each task takes 20 seconds. And every 10 seconds, the worker management task will fire to provision/start/stop up to 3 workers. If you can, try to change these values and let me know if any unexpected behavior occurs.

If everything works as expected, we can provision the real AWS workers and I can modify the management task. In that version, the management task will not attempt to provision any that do not already exist, and would only start/stop them.

annehaley commented 6 months ago

I've gone as far as I can without merging; the ansible script only deploys master to the worker, so it currently does not have the startup script on this branch. We'll have to test it after merging and possibly continue fixing things in another PR.

annehaley commented 5 months ago

@manthey I added comments in a few places to explain the things you suggested. Also, after returning to this branch and using a new build, I realized I needed a protection in the inspect_queue method for calls before the gpu queue exists, so I added a try-except clause that explicitly uses a pyrabbit class.

annehaley commented 5 months ago

After merging and testing the new deployment action, these commits were made to master to ensure the startup script runs each time the worker boots: https://github.com/girder/shapeworks-cloud/commits/master/?since=2024-03-13&until=2024-03-14

annehaley commented 5 months ago

Additional changes after more testing: https://github.com/girder/shapeworks-cloud/commit/67b591f60ed7d891bae890610267a747ceaea186 https://github.com/girder/shapeworks-cloud/commit/a47f575b8c04761b6036b9f3943548437054b17d https://github.com/girder/shapeworks-cloud/commit/5428b2c00b225f0647facc2ff73a41c2c05461e0 https://github.com/girder/shapeworks-cloud/commit/56ebafa9963eb6958154a7438e0f5599664a8341 https://github.com/girder/shapeworks-cloud/commit/f59ce71cbdbe50e55a6b8171b6c7b85e56b32273 https://github.com/girder/shapeworks-cloud/commit/4f1ce4fcf66cf54221fc407374fa60b10d1b777a