Open klutchell opened 2 years ago
@klutchell I'm not sure if this is the reason its failing - but I just noticed that the container names when running on the arm spot instances are leviathan-v2-template2-worker-1
- however we are looking for http://worker
. Could it be that this results in the client not being able to contact the worker in the first place regardless
@klutchell I'm not sure if this is the reason its failing - but I just noticed that the container names when running on the arm spot instances are
leviathan-v2-template2-worker-1
- however we are looking forhttp://worker
. Could it be that this results in the client not being able to contact the worker in the first place regardless
Since the client and the worker are both on the same docker network, they can resolve the internal service names. So whatever the service is called in the compose file will be the DNS name for that service on the docker internal network bridge.
The external container names are created unique by docker-compose and setting COMPOSE_PROJECT
to BUILD_TAG
.
https://github.com/balena-os/leviathan/blob/master/Makefile#L38
@klutchell thanks for the knowledge :) I've just realised that this PR introduced some jank to the multi client: https://github.com/balena-os/leviathan/pull/716 , and there's a chance that is what is causing the quick failure instead of retrying. My PR to address it is here: https://github.com/balena-os/leviathan/pull/739 so I'll try to kick off an arm build with that branch of leviathan
https://jenkins.product-os.io/job/leviathan-v2-template/506/consoleFull
Runs with https://github.com/balena-os/leviathan/pull/739
Cloud suite: failed at preload:
eviathan-v2-template-core-1 | time="2022-04-26T13:54:54.392833514Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/94020cc1f87ea1b3bf28bfd49550d3862baad2cda268784e1f3b389d9a54a55f pid=618
leviathan-v2-template-client-1 | [2022-04-26T13:54:54.442Z][worker-cloud]
leviathan-v2-template-client-1 |
leviathan-v2-template-client-1 | [2022-04-26T13:54:54.636Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:54:54.638Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:54:54.639Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:54:58.101Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=2) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:55:04.546Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=3) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:55:16.997Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=4) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:55:41.446Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=5) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:56:11.901Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=6) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:56:42.358Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=7) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:57:12.827Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=8) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:57:43.279Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=9) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:58:13.735Z][worker-cloud]
leviathan-v2-template-client-1 | Retrying (count=10) /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img False
leviathan-v2-template-client-1 | Hint: If using a Virtual Machine, consider increasing the number of processors.
leviathan-v2-template-client-1 | If using Docker Desktop for Windows or macOS, it may require restarting.
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.172Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.174Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.175Z][worker-cloud]
leviathan-v2-template-core-1 | time="2022-04-26T13:58:14.227946319Z" level=info msg="ignoring event" container=94020cc1f87ea1b3bf28bfd49550d3862baad2cda268784e1f3b389d9a54a55f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
leviathan-v2-template-core-1 | time="2022-04-26T13:58:14.228872058Z" level=info msg="shim disconnected" id=94020cc1f87ea1b3bf28bfd49550d3862baad2cda268784e1f3b389d9a54a55f
leviathan-v2-template-core-1 | time="2022-04-26T13:58:14.231784273Z" level=error msg="copy shim log" error="read /proc/self/fd/12: file already closed"
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.242Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.243Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.244Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.379Z][worker-cloud] An error has occurred executing internal preload command 'get_image_info':
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.380Z][worker-cloud] {"command":"get_image_info","parameters":{}}
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.382Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.382Z][worker-cloud] Status code: 1
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.383Z][worker-cloud] Error: Traceback (most recent call last):
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.384Z][worker-cloud] File "/usr/src/app/preload.py", line 973, in <module>
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.385Z][worker-cloud] result = method(**data.get("parameters", {}))
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.386Z][worker-cloud] File "/usr/src/app/preload.py", line 922, in get_image_info
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.387Z][worker-cloud] init()
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.388Z][worker-cloud] File "/usr/src/app/preload.py", line 940, in init
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.393Z][worker-cloud] PARTITIONS_CACHE[None] = prepare_global_partitions()
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.394Z][worker-cloud] File "/usr/src/app/preload.py", line 132, in prepare_global_partitions
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.395Z][worker-cloud] return get_partitions(IMAGE)
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.397Z][worker-cloud] File "/usr/src/app/preload.py", line 121, in get_partitions
leviathan-v2-template-client-1 | return {p.label: p for p in PartitionTable(image).partitions if p.label}
leviathan-v2-template-client-1 | File "/usr/src/app/preload.py", line 451, in __init__
leviathan-v2-template-client-1 | part = Partition(self, number, **partition_data)
leviathan-v2-template-client-1 | File "/usr/src/app/preload.py", line 236, in __init__
leviathan-v2-template-client-1 | self.label = self._get_label()
leviathan-v2-template-client-1 | File "/usr/src/app/preload.py", line 239, in _get_label
leviathan-v2-template-client-1 | with self.losetup_context_manager() as device:
leviathan-v2-template-client-1 | File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
leviathan-v2-template-client-1 | return next(self.gen)
leviathan-v2-template-client-1 | File "/usr/src/app/preload.py", line 149, in losetup_context_manager
leviathan-v2-template-client-1 | device = retry_call(
leviathan-v2-template-client-1 | File "/usr/lib/python3.8/site-packages/retry/api.py", line 101, in retry_call
leviathan-v2-template-client-1 | return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter, logger)
leviathan-v2-template-client-1 | File "/usr/lib/python3.8/site-packages/retry/api.py", line 33, in __retry_internal
leviathan-v2-template-client-1 | return f()
leviathan-v2-template-client-1 | File "/usr/src/app/preload.py", line 112, in wrapped
leviathan-v2-template-client-1 | return func(*args, **kwargs)
leviathan-v2-template-client-1 | File "/usr/lib/python3.8/site-packages/sh.py", line 1427, in __call__
leviathan-v2-template-client-1 | return RunningCommand(cmd, call_args, stdin, stdout, stderr)
leviathan-v2-template-client-1 | File "/usr/lib/python3.8/site-packages/sh.py", line 774, in __init__
leviathan-v2-template-client-1 | self.wait()
leviathan-v2-template-client-1 | File "/usr/lib/python3.8/site-packages/sh.py", line 792, in wait
leviathan-v2-template-client-1 | self.handle_command_exit_code(exit_code)
leviathan-v2-template-client-1 | File "/usr/lib/python3.8/site-packages/sh.py", line 815, in handle_command_exit_code
leviathan-v2-template-client-1 | raise exc
leviathan-v2-template-client-1 | sh.ErrorReturnCode_1:
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.398Z][worker-cloud] RAN: /sbin/losetup -f --show --offset 4194304 --sizelimit 41943040 /img/balena.img
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.399Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.400Z][worker-cloud] STDOUT:
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.401Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.403Z][worker-cloud]
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.404Z][worker-cloud] STDERR:
leviathan-v2-template-client-1 | [2022-04-26T13:58:14.404Z][worker-cloud] losetup: /img/balena.img: failed to set up loop device: No such file or directory
Hup suite: fails at the beginning:
leviathan-v2-template-client-1 | [2022-04-26T13:50:58.982Z][worker-hup] Fetching balenaOS version 2.12.4+rev1.dev, attempt 1...
leviathan-v2-template-client-1 | [2022-04-26T13:50:59.317Z][worker-hup] TypeError: Cannot read property 'on' of undefined
Hup suite: fails at the beginning
This is because we don't have a balenaOS image in the cloud for generic-aarch64 to HUP from. I think @jakogut will have one published when he is done with the image type deployment improvement.
Looks like the OS suite fails because the device never comes online:
leviathan-v2-template-client-1 | [2022-04-26T16:49:42.136Z][worker-os] Getting ip of dut
leviathan-v2-template-client-1 | [2022-04-26T16:59:36.358Z][worker-os] # Subtest: Unmanaged BalenaOS release suite
leviathan-v2-template-client-1 | [2022-04-26T16:59:36.360Z][worker-os] not ok 1 - 500 - "Could not resolve 55fa10b.local"
This could be because the device isn't booting, or potentially because its very slow - I noticed:
KVM is unavailable, falling back on software emulation
or potentially because its very slow
From the logs I have above, the device is starting but very slowly because we don't have KVM on AWS. We understood when going this route it would be very slow, comparable to a Pi.
did we ever resolve this @klutchell
No, but it hasn't been an issue recently. We could add a snippet to multi-client to wait for the core/status endpoint to be reachable.
See: https://github.com/balena-os/leviathan/pull/734#issuecomment-1108864465
I think the worker started up too slowly and the client didn't retry after failure? https://jenkins.product-os.io/view/tests/job/leviathan-v2-template/476/console