balena-os / leviathan

A distributed hardware testing framework
https://balena-os.github.io/leviathan
Apache License 2.0
22 stars 6 forks source link

If the QEMU worker starts too slowly, the client will fail after one connection attempt #736

Open klutchell opened 2 years ago

klutchell commented 2 years ago

See: https://github.com/balena-os/leviathan/pull/734#issuecomment-1108864465

I think the worker started up too slowly and the client didn't retry after failure? https://jenkins.product-os.io/view/tests/job/leviathan-v2-template/476/console

leviathan-v2-template2-client-1  | [2022-04-25T17:39:26.795Z] INFO: Computing Run Queue
leviathan-v2-template2-client-1  | [2022-04-25T17:39:26.803Z] INFO: [Running Queue] Suites currently in queue: os,hup,cloud
leviathan-v2-template2-client-1  | [2022-04-25T17:39:26.834Z] INFO: Couldn't retrieve http://worker/ worker's state. Querying http://worker/ and received RequestError: undefined
leviathan-v2-template2-client-1  | [2022-04-25T17:39:26.863Z] INFO: ERROR ENCOUNTERED: Error: connect ECONNREFUSED 172.18.0.2:80. 
leviathan-v2-template2-client-1  |  Killing process in 10 seconds...
leviathan-v2-template2-worker-1  | Network interface eth1 is not available
leviathan-v2-template2-worker-1  | Network interface wlan0 is not available
leviathan-v2-template2-worker-1  | QEMU options:
leviathan-v2-template2-worker-1  | {
leviathan-v2-template2-worker-1  |   architecture: 'aarch64',
leviathan-v2-template2-worker-1  |   cpus: '4',
leviathan-v2-template2-worker-1  |   memory: '2G',
leviathan-v2-template2-worker-1  |   debug: false,
leviathan-v2-template2-worker-1  |   network: {
leviathan-v2-template2-worker-1  |     autoconfigure: true,
leviathan-v2-template2-worker-1  |     bridgeName: null,
leviathan-v2-template2-worker-1  |     bridgeAddress: null,
leviathan-v2-template2-worker-1  |     dhcpRange: null,
leviathan-v2-template2-worker-1  |     vncPort: null,
leviathan-v2-template2-worker-1  |     qmpPort: null,
leviathan-v2-template2-worker-1  |     vncMinPort: 5900,
leviathan-v2-template2-worker-1  |     vncMaxPort: 5999,
leviathan-v2-template2-worker-1  |     qmpMinPort: 5000,
leviathan-v2-template2-worker-1  |     qmpMaxPort: 5899
leviathan-v2-template2-worker-1  |   }
leviathan-v2-template2-worker-1  | }
leviathan-v2-template2-worker-1  | API key not available...
leviathan-v2-template2-worker-1  | Found UEFI firmware: {
leviathan-v2-template2-worker-1  |   "code": "/usr/share/OVMF/QEMU_EFI.fd",
leviathan-v2-template2-worker-1  |   "vars": "/usr/share/OVMF/QEMU_VARS.fd"
leviathan-v2-template2-worker-1  | }
leviathan-v2-template2-worker-1  | Worker http listening on port 80
rcooke-warwick commented 2 years ago

@klutchell I'm not sure if this is the reason its failing - but I just noticed that the container names when running on the arm spot instances are leviathan-v2-template2-worker-1 - however we are looking for http://worker . Could it be that this results in the client not being able to contact the worker in the first place regardless

klutchell commented 2 years ago

@klutchell I'm not sure if this is the reason its failing - but I just noticed that the container names when running on the arm spot instances are leviathan-v2-template2-worker-1 - however we are looking for http://worker . Could it be that this results in the client not being able to contact the worker in the first place regardless

Since the client and the worker are both on the same docker network, they can resolve the internal service names. So whatever the service is called in the compose file will be the DNS name for that service on the docker internal network bridge.

The external container names are created unique by docker-compose and setting COMPOSE_PROJECT to BUILD_TAG. https://github.com/balena-os/leviathan/blob/master/Makefile#L38

rcooke-warwick commented 2 years ago

@klutchell thanks for the knowledge :) I've just realised that this PR introduced some jank to the multi client: https://github.com/balena-os/leviathan/pull/716 , and there's a chance that is what is causing the quick failure instead of retrying. My PR to address it is here: https://github.com/balena-os/leviathan/pull/739 so I'll try to kick off an arm build with that branch of leviathan

rcooke-warwick commented 2 years ago

https://jenkins.product-os.io/job/leviathan-v2-template/506/consoleFull

Runs with https://github.com/balena-os/leviathan/pull/739

Cloud suite: failed at preload:

eviathan-v2-template-core-1    | time="2022-04-26T13:54:54.392833514Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/94020cc1f87ea1b3bf28bfd49550d3862baad2cda268784e1f3b389d9a54a55f pid=618
leviathan-v2-template-client-1  | [2022-04-26T13:54:54.442Z][worker-cloud] 
leviathan-v2-template-client-1  | 
leviathan-v2-template-client-1  | [2022-04-26T13:54:54.636Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:54:54.638Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:54:54.639Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:54:58.101Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=2) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:55:04.546Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=3) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:55:16.997Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=4) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:55:41.446Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=5) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:56:11.901Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=6) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:56:42.358Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=7) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:57:12.827Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=8) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:57:43.279Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=9) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:58:13.735Z][worker-cloud] 
leviathan-v2-template-client-1  | Retrying (count=10) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1  | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1  | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.172Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.174Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.175Z][worker-cloud] 
leviathan-v2-template-core-1    | time="2022-04-26T13:58:14.227946319Z" level=info msg="ignoring event" container=94020cc1f87ea1b3bf28bfd49550d3862baad2cda268784e1f3b389d9a54a55f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
leviathan-v2-template-core-1    | time="2022-04-26T13:58:14.228872058Z" level=info msg="shim disconnected" id=94020cc1f87ea1b3bf28bfd49550d3862baad2cda268784e1f3b389d9a54a55f
leviathan-v2-template-core-1    | time="2022-04-26T13:58:14.231784273Z" level=error msg="copy shim log" error="read /proc/self/fd/12: file already closed"
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.242Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.243Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.244Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.379Z][worker-cloud] An error has occurred executing internal preload command 'get_image_info':
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.380Z][worker-cloud] {"command":"get_image_info","parameters":{}}
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.382Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.382Z][worker-cloud] Status code: 1
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.383Z][worker-cloud] Error: Traceback (most recent call last):
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.384Z][worker-cloud]   File "/usr/src/app/preload.py", line 973, in <module>
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.385Z][worker-cloud]     result = method(**data.get("parameters", {}))
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.386Z][worker-cloud]   File "/usr/src/app/preload.py", line 922, in get_image_info
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.387Z][worker-cloud]     init()
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.388Z][worker-cloud]   File "/usr/src/app/preload.py", line 940, in init
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.393Z][worker-cloud]     PARTITIONS_CACHE[None] = prepare_global_partitions()
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.394Z][worker-cloud]   File "/usr/src/app/preload.py", line 132, in prepare_global_partitions
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.395Z][worker-cloud]     return get_partitions(IMAGE)
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.397Z][worker-cloud]   File "/usr/src/app/preload.py", line 121, in get_partitions
leviathan-v2-template-client-1  |     return {p.label: p for p in PartitionTable(image).partitions if p.label}
leviathan-v2-template-client-1  |   File "/usr/src/app/preload.py", line 451, in __init__
leviathan-v2-template-client-1  |     part = Partition(self, number, **partition_data)
leviathan-v2-template-client-1  |   File "/usr/src/app/preload.py", line 236, in __init__
leviathan-v2-template-client-1  |     self.label = self._get_label()
leviathan-v2-template-client-1  |   File "/usr/src/app/preload.py", line 239, in _get_label
leviathan-v2-template-client-1  |     with self.losetup_context_manager() as device:
leviathan-v2-template-client-1  |   File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
leviathan-v2-template-client-1  |     return next(self.gen)
leviathan-v2-template-client-1  |   File "/usr/src/app/preload.py", line 149, in losetup_context_manager
leviathan-v2-template-client-1  |     device = retry_call(
leviathan-v2-template-client-1  |   File "/usr/lib/python3.8/site-packages/retry/api.py", line 101, in retry_call
leviathan-v2-template-client-1  |     return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter, logger)
leviathan-v2-template-client-1  |   File "/usr/lib/python3.8/site-packages/retry/api.py", line 33, in __retry_internal
leviathan-v2-template-client-1  |     return f()
leviathan-v2-template-client-1  |   File "/usr/src/app/preload.py", line 112, in wrapped
leviathan-v2-template-client-1  |     return func(*args, **kwargs)
leviathan-v2-template-client-1  |   File "/usr/lib/python3.8/site-packages/sh.py", line 1427, in __call__
leviathan-v2-template-client-1  |     return RunningCommand(cmd, call_args, stdin, stdout, stderr)
leviathan-v2-template-client-1  |   File "/usr/lib/python3.8/site-packages/sh.py", line 774, in __init__
leviathan-v2-template-client-1  |     self.wait()
leviathan-v2-template-client-1  |   File "/usr/lib/python3.8/site-packages/sh.py", line 792, in wait
leviathan-v2-template-client-1  |     self.handle_command_exit_code(exit_code)
leviathan-v2-template-client-1  |   File "/usr/lib/python3.8/site-packages/sh.py", line 815, in handle_command_exit_code
leviathan-v2-template-client-1  |     raise exc
leviathan-v2-template-client-1  | sh.ErrorReturnCode_1:
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.398Z][worker-cloud]   RAN: /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.399Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.400Z][worker-cloud]   STDOUT:
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.401Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.403Z][worker-cloud] 
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.404Z][worker-cloud]   STDERR:
leviathan-v2-template-client-1  | [2022-04-26T13:58:14.404Z][worker-cloud] losetup: /img/balena.img: failed to set up loop device: No such file or directory

Hup suite: fails at the beginning:

leviathan-v2-template-client-1  | [2022-04-26T13:50:58.982Z][worker-hup] Fetching balenaOS version 2.12.4+rev1.dev, attempt 1...
leviathan-v2-template-client-1  | [2022-04-26T13:50:59.317Z][worker-hup] TypeError: Cannot read property 'on' of undefined
klutchell commented 2 years ago

Hup suite: fails at the beginning

This is because we don't have a balenaOS image in the cloud for generic-aarch64 to HUP from. I think @jakogut will have one published when he is done with the image type deployment improvement.

rcooke-warwick commented 2 years ago

Looks like the OS suite fails because the device never comes online:

leviathan-v2-template-client-1  | [2022-04-26T16:49:42.136Z][worker-os] Getting ip of dut

leviathan-v2-template-client-1  | [2022-04-26T16:59:36.358Z][worker-os] # Subtest: Unmanaged BalenaOS release suite
leviathan-v2-template-client-1  | [2022-04-26T16:59:36.360Z][worker-os]     not ok 1 - 500 - "Could not resolve 55fa10b.local"

This could be because the device isn't booting, or potentially because its very slow - I noticed: KVM is unavailable, falling back on software emulation

klutchell commented 2 years ago

or potentially because its very slow

From the logs I have above, the device is starting but very slowly because we don't have KVM on AWS. We understood when going this route it would be very slow, comparable to a Pi.

rcooke-warwick commented 2 years ago

did we ever resolve this @klutchell

klutchell commented 2 years ago

No, but it hasn't been an issue recently. We could add a snippet to multi-client to wait for the core/status endpoint to be reachable.