aws-deadline / deadline-cloud-worker-agent

The AWS Deadline Cloud worker agent can be used to run a worker in an AWS Deadline Cloud fleet.
Apache License 2.0
15 stars 21 forks source link

Bug: Queue User should not be allowed to be Admin or the same as the agent user #488

Open yuanmich2 opened 2 days ago

yuanmich2 commented 2 days ago

Describe Behaviour

Right now, you can specify an admin queue user, which is a security risk. I accidentally fell into this trap when I accidentally specified the queue user as the agent user by passing it to the --user flag when running the install-deadline-worker command.

Expected Behaviour

You should have to deliberately want to render with admin privileges and accept the security risk before being allowed to do so.

Renders should throw an error if trying to run as an admin user. And if the queue user is the same as the agent user, it should point out that's the root cause of the error. The agent user is always forced to be admin, so the even if you weren't trying to run your render with admin privileges, but misunderstood the --user flag you'd end up setting the agent and queue user to be the same user and running with admin privileges.

Current Behaviour

You just render with a bad security posture because you have admin privileges.

Reproduction Steps

Setup a worker. Use the --user and --grant-required-permissions flags on the queue user when running install-deadline-worker

Possible Solution

Add better warning and error messages to prevent accidentally setting it up this way like I did.

Have a config option that explicitly enables rendering as admin. We can put a comment in the config file that outlines the potential risk.

Package Version

deadline-worker-agent 0.27.4

Language Version

python 3.12

Dependencies

No response

Operating System

Windows

Other information

No response

jusiskin commented 1 day ago

Thanks for reporting your issue with the worker agent.

It sounds like the --user argument of install-deadline-worker was misunderstood to be the user account that jobs will run as. In reality, the --user argument specifies the user that the worker agent runs as. When you run install-deadline-worker --help on a Windows host today, you will see:

  --user USER           The username of the AWS Deadline Cloud Worker Agent user. Defaults to "deadline-worker".

Perhaps there is room in that help text to elaborate further or make readers aware that queue users are the recommended setup.

While the worker agent user CAN also be used for running jobs, it is not recommended (see security best-practice documentation in the Deadline Cloud user guide):

  • Don’t set the queue jobRunAsUser to the name of the OS user that the worker agent runs as.

The intended setup is documented in the "Worker host setup" topic of the Deadline Cloud user guide:

The Deadline Cloud worker agent should run as a dedicated agent-specific user on the host. You should configure the jobRunAsUser property of Deadline Cloud queues so that workers will run the queue jobs as a specific operating system user and group.

These are just best-practices and we recognize that there may be situations where people may choose to deviate from best-practices for business/technical reasons. We believe it is important to provide escape hatches for these use cases.

As you have noted, the security considerations are different on Windows where the worker agent user account must be an administrator. For this reason, we made this configuration only possible when the person setting up a Deadline Cloud queue takes explicit action to configure the queue to run as the agent user.

Background

Today, there are two ways to setup the worker agent on a Windows host that can result in the jobs being run as an administrator user. I've provided some explanation on how the worker agent and Deadline Cloud must be configured to achieve this:

Running jobs as the agent user

The operating system user that the worker agent program runs as must be an administrator. This is because Windows will only permit the creation of subprocesses that run as a different users if the originating process is running as an admin user.

The worker agent makes it difficult (but not impossible) to set up the worker to run jobs as this same agent user. The worker agent:

The only way to set up a farm to run jobs as the agent user is to configure the queue jobRunAsUserwindows (ref docs) to refer to the same operating system user that the agent runs as. The user guide provides security best-practice documentation advising against this, but we wanted to provide an escape hatch for those who choose to accept this risk and need to run jobs as this same user.

Running jobs as an admin queue user

Customers can configure their queue jobRunAsUserwindows to point to a different operating system user that is an administrator. While this is less secure than a non-administrator user, we still want to leave an escape-hatch for customers to accomplish their goals. The "Worker hosts" section of the "Security Best Practices for Deadline Cloud" topic of the user guide states the best-practice here:

Grant queue users least-privileged OS permissions required for the intended queue workloads. Ensure that they don't have filesystem write permissions to work agent program files or other shared software.


With all of this context in mind, I'm wondering if you have any suggestions on what could be improved here. Is there a point in the onboarding journey (documentation / installer / console) where we could best intervene or guide people towards the proper setup? Keeping in mind we do want to keep an escape hatch for those who understand and accept the risks and have a justified need to run their jobs as an administrator.

yuanmich2 commented 1 day ago

Make the escape hatch obvious.

You actually did a good job of outlining a bunch of reasons why I think the solution is not obvious right now. For example, the run_jobs_as_agent_user config option doesn't actually allow you to run as the agent user, it just makes your job fail on Windows which isn't very helpful to anyone. That option causes the worker an error about admin permissions, but paradoxically if you had just specified an admin user with the --user flag, you get no error. In my original post, I opened the escape hatch without knowing I was opening an escape hatch because I had agent and queue users confused.

A config option named something like "allow_admin_users_to_render" would go a long way. We could include it in the auto-generated config file with a big comment that explains the risk very clearly.

It might also be worth changing the flag from --user to --agent-user or something to prevent the confusion I had between agent and queue users in the future, but I think as long as such a confusion doesn't open up security holes, it's not that big of a deal.