Expanded cloud functionality

bcipriano commented 2 years ago

At the moment OpenCue is agnostic as to where resources run -- if RQD can run on a machine and that machine is networked to be able to reach the Cuebot, OpenCue doesn't really care where that machine is running. However, the TSC is currently discussing ways that cloud functionality can be expanded to be easier/more helpful to cloud users.

Major use cases

Easily deploy OpenCue on the cloud.
- Enable a "cloud-first" type of deployment, where the database, Cuebot, and RQD can all be easily deployed to the cloud.
- The solution here should be cloud-agnostic -- ideally a single solution with flavors for each major cloud provider, and easy ways to extend the solution to other clouds.
- The solution could provide a good alternative to users just getting started with OpenCue, similar to the sandbox.
Easily scale RQD workers in the cloud. Help users make informed decisions about how many workers they need to complete their current OpenCue workload.
- This is distinct from the first case -- users may have their RQD pools configured in many different ways, and our solution here should not assume they are using any specific tech.

Easily deploy OpenCue in the cloud

Ideas:

Utilize Terraform for this. Have a set of standard Terraform scripts that can plug into the cloud provider of choice.
Kubernetes could be a good candidate here, as all the major clouds support it and it provides a good path towards a scalable, production-ready deployment.
Which filesystem is used is a question here. OpenCue generally doesn't care, but RQD and the Python components need access to a single shared filesystem. This may vary based on which cloud is in use.
CueGUI currently requires filesystem access for displaying logs. If we can break this dependency we'll have an expanded set of choices of tech, less care as to which filesystem is in use, and the work will be generally less complex. (From https://github.com/AcademySoftwareFoundation/OpenCue/issues/1097#issuecomment-1057567921)
- A good start here would be to eliminate the assumption that logs and assets be stored within the same filesystem.
- We could add an HTTP endpoint to Cuebot or some new process, which could be queried to fetch log contents. This would require Cuebot to have filesystem access, but this is still an improvement over requiring the same of the client-side Python tools. gRPC is also an option -- streaming could be a great fit for logs.
- We could support third-party logging solutions like Azure Monitor or GCP Cloud Logging.

Easily scale RQD workers in the cloud

Ideas:

Don't worry about the specifics (commandline/API call) by which workers are added or subtracted from the pool. This will vary widely depending on the clouds/tech in use, and ultimately is of low-value to users -- it's not hard to add workers, but it's hard to know how many workers should be added.
Provide a hook / API method to the Cuebot level that will help make this decision. Namely, report how much work is in the queue or how many workers are needed to complete that work.

DiegoTavares commented 2 years ago

For logging, it would be interesting to add an HTTP endpoint to enable decoupling the GUI and the filesystem used by RQD.

bcipriano commented 2 years ago

For logging, it would be interesting to add an HTTP endpoint to enable decoupling the GUI and the filesystem used by RQD.

Great idea -- I've incorporated this into the ideas list.

malkia commented 1 year ago

We have a possible use case. We have cuebot on GCP and few render Windows workers there too. Initial tests indicate that this woks fine, scaling to more should not be an issue. The machines and cuebot can see and connect to each other without a problem.

The issue comes when a local worker (right now for development purposes) comes. It's able to talk to the cuebot, registers itself, but cuebot can't talk back (due to our global firewall settings, and we don't want to change that).

I was wondering how people solved this? Proxy maybe? It also made me think for this special case, whether it won't be worth creating an additional grpc bidi stream (both ways) that can both talk to the cuebot, and cuebot can talk back and serve jobs. Understandigly this would means more pressure on the cuebot to keep that connection, so it'll be only useful if there a handful of such machines.

Anyone else run into something like this?

AcademySoftwareFoundation / OpenCue