Closed keyonghan closed 3 years ago
The devices were rebooted a number of times after the python3 update. So probably nothing related to being rebooted. Possibly something timedout/changed since the update?
Rebooted mac-5, and its currently processing tasks, waiting to see result.
mac-5 still failing:
[W2021-08-19T12:13:36.607493-07:00 672 0 client.go:306] RPC failed transiently. Will retry in 16s {"error":"rpc error: code = Internal desc = prpc: when sending request: Post \"https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix\": local auth - RPC code 4: HTTP 400 from the token server:\nfailed to mint token for \"flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com\" - rpc error: code = NotFound desc = Requested entity was not found.\n", "host":"logs.chromium.org", "method":"RegisterPrefix", "service":"logdog.Registration", "sleepTime":"16s"}
[W2021-08-19T12:13:52.684278-07:00 672 0 auth.go:1399] Failed to mint a token: local auth - RPC code 4: HTTP 400 from the token server:
failed to mint token for "flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com" - rpc error: code = NotFound desc = Requested entity was not found.
{"key":"luci_ctx/8d21dcf72e39204ad5de41562e12b383c5bc5cae1d39d7617ec7dc4bb6b9faca", "scopes":"https://www.googleapis.com/auth/userinfo.email"}
[W2021-08-19T12:13:52.684637-07:00 672 0 client.go:350] RPC failed permanently: rpc error: code = Internal desc = prpc: when sending request: Post "https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix": local auth - RPC code 4: HTTP 400 from the token server:
failed to mint token for "flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com" - rpc error: code = NotFound desc = Requested entity was not found.
{"error":"rpc error: code = Internal desc = prpc: when sending request: Post \"https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix\": local auth - RPC code 4: HTTP 400 from the token server:\nfailed to mint token for \"flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com\" - rpc error: code = NotFound desc = Requested entity was not found.\n", "host":"logs.chromium.org", "method":"RegisterPrefix", "service":"logdog.Registration"}
[E2021-08-19T12:13:52.685467-07:00 672 0 output.go:129] Failed to register prefix with Coordinator service. {"error":"rpc error: code = Internal desc = prpc: when sending request: Post \"https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix\": local auth - RPC code 4: HTTP 400 from the token server:\nfailed to mint token for \"flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com\" - rpc error: code = NotFound desc = Requested entity was not found.\n"}
[E2021-08-19T12:13:52.685825-07:00 672 0 main.go:76] could not create logdog output: rpc error: code = Internal desc = prpc: when sending request: Post "https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix": local auth - RPC code 4: HTTP 400 from the token server:
failed to mint token for "flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com" - rpc error: code = NotFound desc = Requested entity was not found.
These errors seem to be due to changes being made to service accounts: (i.e. https://chrome-internal.googlesource.com/infradata/config/+/a1e67dc1f70e47bb6cc22cc2053b7668caa0af63)
Once those steps are completed, this should work correctly.
[ERROR ] The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate
Minion failed to authenticate with the master, has the minion key been accepted?
The machine keys didn't match the keys in the salt server. Update the keys and rebooted the bot and they are now back online.
@yusufm any updates on the mac machines failing to run salt states?
The salt keys are correctly accepted:
salt-dev:/home/mohsinally# salt-key list
Accepted Keys:
...
flutter-devicelab-mac-4
flutter-devicelab-mac-5
I'm going to dig into the logs before trying to restart the bots (which usually ends up fixing this issue). Potentially its related to a certificate expiry issue, which is causing the bots to not be able to phone home.
Looks like the handshake issue (mac-4):
98 2021-08-23 18:51:33.264 W: Unable to open url https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake on attempt 0: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
98 2021-08-23 18:51:34.915 W: Retrying request https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake, attempt 1/30...
98 2021-08-23 18:51:34.917 D: Starting new HTTPS connection (912): chromium-swarm.appspot.com:443
98 2021-08-23 18:51:35.261 D: https://chromium-swarm.appspot.com:443 "POST /swarming/api/v1/bot/handshake HTTP/1.1" 403 295
98 2021-08-23 18:51:35.262 W: Got a reply with HTTP status code 403 for https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake on attempt 1: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
98 2021-08-23 18:51:35.262 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
98 2021-08-23 18:51:35.263 E: Failed to contact for handshake, retrying in 300 sec...
The handshake issue is because the salt updates are failing before the all the services are created/started. What do you get when you run salt manually on the machine?
As per offline conversation, it looks like the hostname that is being reported when sending in the auth token is incorrect. This is possibly because the swarming bot caches the hostname right at startup, before dhcp has a chance to update it. Trying to set a hostname on the mac with these commands:
flutter@flutter-devicelab-mac-4 ~ % sudo scutil --set HostName flutter-devicelab-mac-4
flutter@flutter-devicelab-mac-4 ~ % sudo scutil --set LocalHostName flutter-devicelab-mac-4
flutter@flutter-devicelab-mac-4 ~ % sudo scutil --set ComputerName "flutter-devicelab-mac-4"
mac-4 has been running successfully so far, since running the scutil commands above.
Making the same changes on mac-5 and rebooting it.
Will monitor the hosts for another couple days to see if this resolves the problem.
TODO: on success, add commands to mac provision scripts.
Updating the fqdn.sh script to not update if the fqdn is already as expected. If the problem we are seeing is due to some intermediate bad state while the hostname is being set at restart, this change should help.
The fqdn CL to only update the fqdn if necessary did not seem to work when manually forcing a reboot of the host.
Requesting logs to see if the same error is showing up.
Restarting the swarming bot fixes the issue:
sudo launchctl unload /Library/LaunchDaemons/org.swarm.bot.plist && sudo launchctl load -w /Library/LaunchDaemons/org.swarm.bot.plist
Mac4 is missing again.
The reboots are temporary, they seem to go bad on next reboot. We need the logs from the server side to determine if they are still reporting hostname as "1", which would be causing the auth to fail.
Two salt states are failing:
2021-09-08 12:26:47,956 [salt.state :321 ][ERROR ][51487] An exception occurred in this state: Traceback (most recent call last):
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/state.py", line 2154, in call
*cdata["args"], **cdata["kwargs"]
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/loader.py", line 2188, in wrapper
return f(*args, **kwargs)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/states/file.py", line 3218, in managed
tmp_filename = salt.utils.files.mkstemp(suffix=tmp_ext, dir=tmp_dir)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/utils/files.py", line 106, in mkstemp
fd_, f_path = tempfile.mkstemp(*args, **kwargs)
File "/opt/salt/lib/python3.7/tempfile.py", line 340, in mkstemp
return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/opt/salt/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
OSError: [Errno 30] Read-only file system: '__salt.tmp.kiayaqaa'
2021-09-08 12:26:49,434 [salt.state :321 ][ERROR ][51487] State 'pkg.installed' was not found in SLS 'machine_tokend'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded
2021-09-08 12:26:49,625 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/luci/machine_tokend/', '-ensure-file', '/tmp/cipd-clr22uch']' failed with return code: 5
2021-09-08 12:26:49,626 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR ][51487] retcode: 5
2021-09-08 12:26:52,522 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor', '-ensure-file', '/tmp/cipd-j4wt2ura']' failed with return code: 5
2021-09-08 12:26:52,523 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR ][51487] retcode: 5
2021-09-08 12:26:52,983 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor_staging', '-ensure-file', '/tmp/cipd-pkg7_rqs']' failed with return code: 5
2021-09-08 12:26:52,984 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR ][51487] retcode: 5
2021-09-08 12:26:57,457 [salt.beacons :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,777 [salt.beacons :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,979 [salt.state :321 ][ERROR ][51487] State 'pkg.installed' was not found in SLS 'git/package'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded
2021-09-08 12:26:49,434 [salt.state :321 ][ERROR ][51487] State 'pkg.installed' was not found in SLS 'machine_tokend'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded
2021-09-08 12:26:49,625 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/luci/machine_tokend/', '-ensure-file', '/tmp/cipd-clr22uch']' failed with return code: 5
2021-09-08 12:26:49,626 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR ][51487] retcode: 5
2021-09-08 12:26:52,522 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor', '-ensure-file', '/tmp/cipd-j4wt2ura']' failed with return code: 5
2021-09-08 12:26:52,523 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR ][51487] retcode: 5
2021-09-08 12:26:52,983 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor_staging', '-ensure-file', '/tmp/cipd-pkg7_rqs']' failed with return code: 5
2021-09-08 12:26:52,984 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR ][51487] retcode: 5
2021-09-08 12:26:57,457 [salt.beacons :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,777 [salt.beacons :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,979 [salt.state :321 ][ERROR ][51487] State 'pkg.installed' was not found in SLS 'git/package'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded
And
2021-09-09 02:09:17,314 [tornado.application:356 ][ERROR ][553] Future <salt.ext.tornado.concurrent.Future object at 0x10dc3a990> exception was never retrieved: Traceback (most recent call last):
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 70, in _decode_payload
payload["load"] = self.auth.crypticle.loads(payload["load"])
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
data = self.decrypt(data)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 661, in wrap_callback
payload = yield self._decode_messages(messages)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 636, in _decode_messages
ret = yield self._decode_payload(payload)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1070, in run
yielded = self.gen.send(value)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 73, in _decode_payload
payload["load"] = self.auth.crypticle.loads(payload["load"])
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
data = self.decrypt(data)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed
2021-09-09 02:09:17,316 [tornado.application:356 ][ERROR ][553] Future <salt.ext.tornado.concurrent.Future object at 0x10cd825d0> exception was never retrieved: Traceback (most recent call last):
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 70, in _decode_payload
payload["load"] = self.auth.crypticle.loads(payload["load"])
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
data = self.decrypt(data)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 661, in wrap_callback
payload = yield self._decode_messages(messages)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 636, in _decode_messages
ret = yield self._decode_payload(payload)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1070, in run
yielded = self.gen.send(value)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 73, in _decode_payload
payload["load"] = self.auth.crypticle.loads(payload["load"])
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
data = self.decrypt(data)
File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed
Added a CL to try to fix this but is not working as expected.
Landed a change to fix the error when installing the git.packages state. It turns out that Mac M1s require the brew command to execute with an architecture parameter.
Bots have been up and running for the last 2 days. Closing this bug for now, please reopen if they fail again.
This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v
and a minimal reproduction of the issue.
Both mac4 and mac5 failed to come back online after task fails and reboot. Not sure if this is related to recent python3 migration.
/cc @yusufm
From swarming_bot.log: