aws / aws-rfdk

The Render Farm Deployment Kit on AWS is a library for use with the AWS Cloud Development Kit that helps you define your render farm cloud infrastructure as code.
https://docs.aws.amazon.com/rfdk/index.html
Apache License 2.0
109 stars 42 forks source link

Worker configuration doesn't work with Deadline 10.1.11 #237

Closed horsmand closed 3 years ago

horsmand commented 3 years ago

The RFDK integration tests are failing to be able to assign workers to pools and groups in the worker fleet tests when running with Deadline 10.1.11.5. In this Deadline release, worker AMI's were changed so that the workers won't auto-start, which is a likely culprit for why the RFDK group/pool setup code isn't working.

Reproduction Steps

The results that I saw failures in are from my branch (which is up-to-date with 5f2ce7f796b0425b966c3ba4050a13ff618132a4 and not missing any major code changes) where I'm enabling the integration tests to run from a CodeBuild Project. I've split up the worker fleet tests into 2 groups of tests, workers with HTTP connections and workers with HTTPS connections, so these are only the results for the HTTP tests, but these same tests passed when using Deadline 10.1.10.6 so I have reason to believe it's a compatibility issue with 10.1.11.5 and we will see the same failures when using the release or mainline branches. I haven't modified the tests or their setup at all from those in mainline.

Attempt a run of the integration tests with the following configuration in integ/test-config.sh:

export USER_ACCEPTS_SSPL_FOR_RFDK_TESTS=true
export DEADLINE_VERSION='10.1.11.5'
export SKIP_deadline_01_repository_TEST=true
export SKIP_deadline_02_renderQueue_TEST=true

Error Log

Here are my test results, with some output from the first failing tests (WF 1-2 and WF 1-3). The other failures looked similar.

119 | Deadline WorkerFleet tests (Linux Worker HTTP mode)
120 | Worker node tests
121 | ✓ WF-1-1: Workers can be attached to the Render Queue (3221 ms)
122 | ✕ WF-1-2: Workers can be added to groups, pools and regions (7439 ms)
123 | ✕ WF-1-3: Workers can be assigned jobs submitted to a group (23670 ms)
124 | ✕ WF-1-4: Workers can be assigned jobs submitted to a pool (21612 ms)
125 | Deadline WorkerFleet tests (Windows Worker HTTP mode)
126 | Worker node tests
127 | ✓ WF-2-1: Workers can be attached to the Render Queue (3213 ms)
128 | ✕ WF-2-2: Workers can be added to groups, pools and regions (4231 ms)
129 | ✕ WF-2-3: Workers can be assigned jobs submitted to a group (22657 ms)
130 | ✕ WF-2-4: Workers can be assigned jobs submitted to a pool (22609 ms)
131 |  
132 | ● Deadline WorkerFleet tests (Linux Worker HTTP mode) › Worker node tests › WF-1-2: Workers can be added to groups, pools and regions
133 |  
134 | expect(received).toMatch(expected)
135 |  
136 | Expected pattern: /testpool\ntestgroup\ntestregion/
137 | Received string:  "testpool
138 | testgroup
139 | none
140 | "
141 |  
142 | 124 \|       return awaitSsmCommand(bastionId, params).then( response => {
143 | 125 \|         var responseOutput = response.output;
144 | > 126 \|         expect(responseOutput).toMatch(/testpool\ntestgroup\ntestregion/);
145 | \|                                ^
146 | 127 \|       });
147 | 128 \|     });
148 | 129 \|
149 |  
150 | at awaitSsmCommand_1.default.then.response (components/deadline/deadline_03_workerFleetHttp/test/deadline_03_workerFleetHttp.test.ts:126:32)
151 |  
152 | ● Deadline WorkerFleet tests (Linux Worker HTTP mode) › Worker node tests › WF-1-3: Workers can be assigned jobs submitted to a group
153 |  
154 | thrown: Object {
155 | "Name": "aws:runShellScript",
156 | "Output": "
157 | ----------ERROR-------
158 | failed to run commands: exit status 1",
159 | "OutputS3BucketName": "",
160 | "OutputS3KeyPrefix": "",
161 | "OutputS3Region": "us-west-2",
162 | "ResponseCode": 1,
163 | "ResponseFinishDateTime": 2020-11-17T03:17:56.207Z,
164 | "ResponseStartDateTime": 2020-11-17T03:17:34.875Z,
165 | "StandardErrorUrl": "",
166 | "StandardOutputUrl": "",
167 | "Status": "Failed",
168 | "StatusDetails": "Failed",
169 | }
170 |  
171 | 134 \|
172 | 135 \|     // eslint-disable-next-line @typescript-eslint/no-shadow
173 | > 136 \|     test.each(setConfigs)(`WF-${id}-%i: Workers can be assigned jobs submitted to a %s`, async (_, name, arg) => {
174 | \|                          ^
175 | 137 \|       /**********************************************************************************************************
176 | 138 \|        * TestID:          WF-3, WF-4
177 | 139 \|        * Description:     Confirm that jobs sent to a specified group/pool/region are routed to a worker in that set
178 |  
179 | at new Spec (../node_modules/jest-jasmine2/build/jasmine/Spec.js:116:22)
180 | at Array.forEach (<anonymous>)
181 | at Suite.describe (components/deadline/deadline_03_workerFleetHttp/test/deadline_03_workerFleetHttp.test.ts:136:26)
182 | at Object.<anonymous>.describe.each (components/deadline/deadline_03_workerFleetHttp/test/deadline_03_workerFleetHttp.test.ts:58:3)
183 | at Array.forEach (<anonymous>)
184 | at Object.<anonymous> (components/deadline/deadline_03_workerFleetHttp/test/deadline_03_workerFleetHttp.test.ts:57:25)
185 |  

325 | Test Suites: 1 failed, 1 total
326 | Tests:       6 failed, 2 passed, 8 total
327 | Snapshots:   0 total
328 | Time:        124.712 s
329 | Ran all test suites matching /deadline_03_workerFleetHttp.test/i.
330 | Test results written to: .e2etemp/deadline_03_workerFleetHttp.json
331 | error Command failed with exit code 1.

Below is the portion of a worker's cloud init log where configureWorker.sh gets run. I believe the line WORKER_NAMES=() shows an issue; there should have been a worker in this list and the lines after that are trying to apply the group testgroup to the empty list.

2020-11-16T20:45:37.136-06:00   + '[' -z '' ']'
2020-11-16T20:45:37.136-06:00   + echo 'INFO: WORKER_REGION not provided'
2020-11-16T20:45:37.136-06:00   INFO: WORKER_REGION not provided
2020-11-16T20:45:37.136-06:00   ++ hostname -s
2020-11-16T20:45:37.136-06:00   + WORKER_NAME_PREFIX=ip-10-0-193-155
2020-11-16T20:45:37.136-06:00   + WORKER_NAMES=()
2020-11-16T20:45:37.136-06:00   + shopt -s dotglob
2020-11-16T20:45:37.136-06:00   + for file in '/var/lib/Thinkbox/Deadline10/slaves/*'
2020-11-16T20:45:37.136-06:00   + file='*'
2020-11-16T20:45:37.136-06:00   + workerSuffix='*'
2020-11-16T20:45:37.136-06:00   + '[' -z '*' ']'
2020-11-16T20:45:37.136-06:00   + WORKER_NAMES+=("$WORKER_NAME_PREFIX"-$workerSuffix)
2020-11-16T20:45:37.136-06:00   + shopt -u dotglob
2020-11-16T20:45:37.136-06:00   + '[' 1 -gt 0 ']'
2020-11-16T20:45:37.136-06:00   + for group in '"${WORKER_GROUPS[@]}"'
2020-11-16T20:45:37.136-06:00   + existingGroups=($("$DEADLINE_COMMAND" -GetGroupNames))
2020-11-16T20:45:37.886-06:00   ++ /opt/Thinkbox/Deadline10/bin/deadlinecommand -GetGroupNames
2020-11-16T20:45:37.886-06:00   + [[ ! none =~ testgroup ]]
2020-11-16T20:45:38.887-06:00   + /opt/Thinkbox/Deadline10/bin/deadlinecommand -AddGroup testgroup
2020-11-16T20:45:38.887-06:00   Group testgroup added
2020-11-16T20:45:38.887-06:00   Successfully added group: testgroup
2020-11-16T20:45:38.887-06:00   ++ IFS=,
2020-11-16T20:45:38.887-06:00   ++ echo 'ip-10-0-193-155-*'
2020-11-16T20:45:38.887-06:00   ++ IFS=,
2020-11-16T20:45:38.887-06:00   ++ echo testgroup
2020-11-16T20:45:40.891-06:00   + /opt/Thinkbox/Deadline10/bin/deadlinecommand -SetGroupsForSlave 'ip-10-0-193-155-*' testgroup
2020-11-16T20:45:40.891-06:00   Set groups to testgroup
2020-11-16T20:45:40.891-06:00   + '[' 0 -gt 0 ']'
2020-11-16T20:45:40.891-06:00   + service --status-all

Environment

Other

Next steps will be to confirm that the issue is with the changes made to workers not automatically starting, then assess how to update the UserData script to configure the workers to handle this.


This is :bug: Bug Report

jusiskin commented 3 years ago

This was resolved in #248