flutter / flutter

Flutter makes it easy and fast to build beautiful apps for mobile and beyond
https://flutter.dev
BSD 3-Clause "New" or "Revised" License
165.63k stars 27.35k forks source link

mac4 and mac5 failed to contact for handshake #88541

Closed keyonghan closed 3 years ago

keyonghan commented 3 years ago

Both mac4 and mac5 failed to come back online after task fails and reboot. Not sure if this is related to recent python3 migration.

/cc @yusufm

Screen Shot 2021-08-19 at 11 43 57 AM

From swarming_bot.log:

98 2021-08-19 18:34:07.915 E: Failed to contact for handshake, retrying in 300 sec...
98 2021-08-19 18:39:07.938 I: Calling /swarming/api/v1/bot/handshake
98 2021-08-19 18:39:07.939 D: GOOGAPPUID = sha1(2021-08-19-1:/opt/s) % 1000 = 78
98 2021-08-19 18:39:07.943 W: Unable to open url https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake on attempt 0: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
98 2021-08-19 18:39:09.357 W: Retrying request https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake, attempt 1/30...
98 2021-08-19 18:39:09.360 D: Starting new HTTPS connection (214): chromium-swarm.appspot.com:443
98 2021-08-19 18:39:09.500 D: https://chromium-swarm.appspot.com:443 "POST /swarming/api/v1/bot/handshake HTTP/1.1" 403 295
98 2021-08-19 18:39:09.501 W: Got a reply with HTTP status code 403 for https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake on attempt 1: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
98 2021-08-19 18:39:09.502 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
98 2021-08-19 18:39:09.502 E: Failed to contact for handshake, retrying in 300 sec...
~                                                                                      
yusufm commented 3 years ago

The devices were rebooted a number of times after the python3 update. So probably nothing related to being rebooted. Possibly something timedout/changed since the update?

yusufm commented 3 years ago

Rebooted mac-5, and its currently processing tasks, waiting to see result.

yusufm commented 3 years ago

mac-5 still failing:

[W2021-08-19T12:13:36.607493-07:00 672 0 client.go:306] RPC failed transiently. Will retry in 16s   {"error":"rpc error: code = Internal desc = prpc: when sending request: Post \"https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix\": local auth - RPC code 4: HTTP 400 from the token server:\nfailed to mint token for \"flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com\" - rpc error: code = NotFound desc = Requested entity was not found.\n", "host":"logs.chromium.org", "method":"RegisterPrefix", "service":"logdog.Registration", "sleepTime":"16s"}
[W2021-08-19T12:13:52.684278-07:00 672 0 auth.go:1399] Failed to mint a token: local auth - RPC code 4: HTTP 400 from the token server:
failed to mint token for "flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com" - rpc error: code = NotFound desc = Requested entity was not found.
 {"key":"luci_ctx/8d21dcf72e39204ad5de41562e12b383c5bc5cae1d39d7617ec7dc4bb6b9faca", "scopes":"https://www.googleapis.com/auth/userinfo.email"}
[W2021-08-19T12:13:52.684637-07:00 672 0 client.go:350] RPC failed permanently: rpc error: code = Internal desc = prpc: when sending request: Post "https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix": local auth - RPC code 4: HTTP 400 from the token server:
failed to mint token for "flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com" - rpc error: code = NotFound desc = Requested entity was not found.
 {"error":"rpc error: code = Internal desc = prpc: when sending request: Post \"https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix\": local auth - RPC code 4: HTTP 400 from the token server:\nfailed to mint token for \"flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com\" - rpc error: code = NotFound desc = Requested entity was not found.\n", "host":"logs.chromium.org", "method":"RegisterPrefix", "service":"logdog.Registration"}
[E2021-08-19T12:13:52.685467-07:00 672 0 output.go:129] Failed to register prefix with Coordinator service. {"error":"rpc error: code = Internal desc = prpc: when sending request: Post \"https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix\": local auth - RPC code 4: HTTP 400 from the token server:\nfailed to mint token for \"flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com\" - rpc error: code = NotFound desc = Requested entity was not found.\n"}
[E2021-08-19T12:13:52.685825-07:00 672 0 main.go:76] could not create logdog output: rpc error: code = Internal desc = prpc: when sending request: Post "https://logs.chromium.org/prpc/logdog.Registration/RegisterPrefix": local auth - RPC code 4: HTTP 400 from the token server:
failed to mint token for "flutter-staging-builder@chops-service-accounts.iam.gserviceaccount.com" - rpc error: code = NotFound desc = Requested entity was not found.
yusufm commented 3 years ago

These errors seem to be due to changes being made to service accounts: (i.e. https://chrome-internal.googlesource.com/infradata/config/+/a1e67dc1f70e47bb6cc22cc2053b7668caa0af63)

Once those steps are completed, this should work correctly.

godofredoc commented 3 years ago
[ERROR   ] The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate
Minion failed to authenticate with the master, has the minion key been accepted?
godofredoc commented 3 years ago

The machine keys didn't match the keys in the salt server. Update the keys and rebooted the bot and they are now back online.

godofredoc commented 3 years ago

@yusufm any updates on the mac machines failing to run salt states?

yusufm commented 3 years ago

The salt keys are correctly accepted:

salt-dev:/home/mohsinally# salt-key list 
Accepted Keys:
...
flutter-devicelab-mac-4
flutter-devicelab-mac-5

I'm going to dig into the logs before trying to restart the bots (which usually ends up fixing this issue). Potentially its related to a certificate expiry issue, which is causing the bots to not be able to phone home.

yusufm commented 3 years ago

Looks like the handshake issue (mac-4):

98 2021-08-23 18:51:33.264 W: Unable to open url https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake on attempt 0: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
98 2021-08-23 18:51:34.915 W: Retrying request https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake, attempt 1/30...
98 2021-08-23 18:51:34.917 D: Starting new HTTPS connection (912): chromium-swarm.appspot.com:443
98 2021-08-23 18:51:35.261 D: https://chromium-swarm.appspot.com:443 "POST /swarming/api/v1/bot/handshake HTTP/1.1" 403 295
98 2021-08-23 18:51:35.262 W: Got a reply with HTTP status code 403 for https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake on attempt 1: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
98 2021-08-23 18:51:35.262 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
98 2021-08-23 18:51:35.263 E: Failed to contact for handshake, retrying in 300 sec...
godofredoc commented 3 years ago

The handshake issue is because the salt updates are failing before the all the services are created/started. What do you get when you run salt manually on the machine?

yusufm commented 3 years ago

As per offline conversation, it looks like the hostname that is being reported when sending in the auth token is incorrect. This is possibly because the swarming bot caches the hostname right at startup, before dhcp has a chance to update it. Trying to set a hostname on the mac with these commands:

flutter@flutter-devicelab-mac-4 ~ % sudo scutil --set HostName flutter-devicelab-mac-4
flutter@flutter-devicelab-mac-4 ~ % sudo scutil --set LocalHostName flutter-devicelab-mac-4
flutter@flutter-devicelab-mac-4 ~ % sudo scutil --set ComputerName "flutter-devicelab-mac-4" 
yusufm commented 3 years ago

mac-4 has been running successfully so far, since running the scutil commands above.

Making the same changes on mac-5 and rebooting it.

Will monitor the hosts for another couple days to see if this resolves the problem.

yusufm commented 3 years ago

TODO: on success, add commands to mac provision scripts.

yusufm commented 3 years ago

Updating the fqdn.sh script to not update if the fqdn is already as expected. If the problem we are seeing is due to some intermediate bad state while the hostname is being set at restart, this change should help.

yusufm commented 3 years ago

The fqdn CL to only update the fqdn if necessary did not seem to work when manually forcing a reboot of the host.

Requesting logs to see if the same error is showing up.

godofredoc commented 3 years ago

Restarting the swarming bot fixes the issue:

sudo launchctl unload /Library/LaunchDaemons/org.swarm.bot.plist && sudo launchctl load -w /Library/LaunchDaemons/org.swarm.bot.plist

keyonghan commented 3 years ago

Mac4 is missing again.

Screen Shot 2021-09-07 at 10 03 57 AM
yusufm commented 3 years ago

The reboots are temporary, they seem to go bad on next reboot. We need the logs from the server side to determine if they are still reporting hostname as "1", which would be causing the auth to fail.

godofredoc commented 3 years ago

Two salt states are failing:

2021-09-08 12:26:47,956 [salt.state       :321 ][ERROR   ][51487] An exception occurred in this state: Traceback (most recent call last):
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/state.py", line 2154, in call
    *cdata["args"], **cdata["kwargs"]
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/loader.py", line 2188, in wrapper
    return f(*args, **kwargs)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/states/file.py", line 3218, in managed
    tmp_filename = salt.utils.files.mkstemp(suffix=tmp_ext, dir=tmp_dir)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/utils/files.py", line 106, in mkstemp
    fd_, f_path = tempfile.mkstemp(*args, **kwargs)
  File "/opt/salt/lib/python3.7/tempfile.py", line 340, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/opt/salt/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 30] Read-only file system: '__salt.tmp.kiayaqaa'

2021-09-08 12:26:49,434 [salt.state       :321 ][ERROR   ][51487] State 'pkg.installed' was not found in SLS 'machine_tokend'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded

2021-09-08 12:26:49,625 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR   ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/luci/machine_tokend/', '-ensure-file', '/tmp/cipd-clr22uch']' failed with return code: 5
2021-09-08 12:26:49,626 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR   ][51487] retcode: 5
2021-09-08 12:26:52,522 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR   ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor', '-ensure-file', '/tmp/cipd-j4wt2ura']' failed with return code: 5
2021-09-08 12:26:52,523 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR   ][51487] retcode: 5
2021-09-08 12:26:52,983 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR   ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor_staging', '-ensure-file', '/tmp/cipd-pkg7_rqs']' failed with return code: 5
2021-09-08 12:26:52,984 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR   ][51487] retcode: 5
2021-09-08 12:26:57,457 [salt.beacons     :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,777 [salt.beacons     :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,979 [salt.state       :321 ][ERROR   ][51487] State 'pkg.installed' was not found in SLS 'git/package'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded

2021-09-08 12:26:49,434 [salt.state       :321 ][ERROR   ][51487] State 'pkg.installed' was not found in SLS 'machine_tokend'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded

2021-09-08 12:26:49,625 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR   ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/luci/machine_tokend/', '-ensure-file', '/tmp/cipd-clr22uch']' failed with return code: 5
2021-09-08 12:26:49,626 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR   ][51487] retcode: 5
2021-09-08 12:26:52,522 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR   ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor', '-ensure-file', '/tmp/cipd-j4wt2ura']' failed with return code: 5
2021-09-08 12:26:52,523 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR   ][51487] retcode: 5
2021-09-08 12:26:52,983 [salt.loaded_4548121040.int.module.cmdmod:845 ][ERROR   ][51487] Command '['/usr/local/bin/cipd', 'puppet-check-updates', '-root', '/opt/flutter/device_doctor_staging', '-ensure-file', '/tmp/cipd-pkg7_rqs']' failed with return code: 5
2021-09-08 12:26:52,984 [salt.loaded_4548121040.int.module.cmdmod:851 ][ERROR   ][51487] retcode: 5
2021-09-08 12:26:57,457 [salt.beacons     :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,777 [salt.beacons     :141 ][WARNING ][553] Unable to process beacon wtmp
2021-09-08 12:26:57,979 [salt.state       :321 ][ERROR   ][51487] State 'pkg.installed' was not found in SLS 'git/package'
Reason: 'pkg' __virtual__ returned False: pkg module could not be loaded

And

2021-09-09 02:09:17,314 [tornado.application:356 ][ERROR   ][553] Future <salt.ext.tornado.concurrent.Future object at 0x10dc3a990> exception was never retrieved: Traceback (most recent call last):
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 70, in _decode_payload
    payload["load"] = self.auth.crypticle.loads(payload["load"])
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
    data = self.decrypt(data)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
    raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 661, in wrap_callback
    payload = yield self._decode_messages(messages)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 636, in _decode_messages
    ret = yield self._decode_payload(payload)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 73, in _decode_payload
    payload["load"] = self.auth.crypticle.loads(payload["load"])
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
    data = self.decrypt(data)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
    raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed
2021-09-09 02:09:17,316 [tornado.application:356 ][ERROR   ][553] Future <salt.ext.tornado.concurrent.Future object at 0x10cd825d0> exception was never retrieved: Traceback (most recent call last):
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 70, in _decode_payload
    payload["load"] = self.auth.crypticle.loads(payload["load"])
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
    data = self.decrypt(data)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
    raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 661, in wrap_callback
    payload = yield self._decode_messages(messages)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/zeromq.py", line 636, in _decode_messages
    ret = yield self._decode_payload(payload)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/transport/mixins/auth.py", line 73, in _decode_payload
    payload["load"] = self.auth.crypticle.loads(payload["load"])
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1541, in loads
    data = self.decrypt(data)
  File "/opt/salt/lib/python3.7/site-packages/salt-3002.1-py3.7.egg/salt/crypt.py", line 1517, in decrypt
    raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed
godofredoc commented 3 years ago

Added a CL to try to fix this but is not working as expected.

godofredoc commented 3 years ago

Landed a change to fix the error when installing the git.packages state. It turns out that Mac M1s require the brew command to execute with an architecture parameter.

godofredoc commented 3 years ago

Bots have been up and running for the last 2 days. Closing this bug for now, please reopen if they fail again.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.