Closed wael-sadek closed 1 year ago
Thanks for raising this issue - I'm not super familiar with this CSI driver, but I'm wondering if it's related to this issue in the upstream kubernetes SIG?
https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/243
The error:
Are the lustre modules loaded?
Makes me think that the mount is attempted before the daemonset /modules are ready. Are you able to apply a taint that will wait for the daemonset to be ready?
I can also spin up a quick cluster and give this a try.
Makes me think that the mount is attempted before the daemonset /modules are ready. Are you able to apply a taint that will wait for the daemonset to be ready?
The daemonset was running, that was checked.
I'm getting similar results on my test cluster running through the basic tutorial. I'm not sure if this is related to bottlerocket specifically, but we can get more insight from that team in a new issue I've created: https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/289
Hello @wael-sadek, for your v1.23 testing, did you use the same Bottlerocket version as with v1.24? You can check the bottlerocket version by running apiclient get os
from the control container. I'm asking because I want to pin down whether the problem is with a specific Bottlerocket version, or, with the v1.24 kubelet sources.
@arnaldo2792
ssm-user@control]$ apiclient get os
{
"os": {
"arch": "x86_64",
"build_id": "104f8e0f",
"pretty_name": "Bottlerocket OS 1.11.1 (aws-k8s-1.23)",
"variant_id": "aws-k8s-1.23",
"version_id": "1.11.1"
}
}
ssm-user@control]$ apiclient get os
{
"os": {
"arch": "x86_64",
"build_id": "104f8e0f",
"pretty_name": "Bottlerocket OS 1.11.1 (aws-k8s-1.24)",
"variant_id": "aws-k8s-1.24",
"version_id": "1.11.1"
}
}
So it's using the same bottlerocket version, I wonder why it works with v1.23 but not with v1.24 ? Could it be some customization done along with the update ?
I built new 1.24
variants locally and deployed those into a cluster:
❯ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-146-131.us-west-2.compute.internal Ready <none> 76m v1.24.9-eks-4f83af2
ip-192-168-5-150.us-west-2.compute.internal Ready <none> 76m v1.24.9-eks-4f83af2
Then, I went through the sample application in the tutorial and was able to successfully deploy the filesystem:
❯ kubectl exec -ti fsx-app -- df -h
Filesystem Size Used Avail Use% Mounted on
overlay 20G 3.1G 16G 17% /
tmpfs 64M 0 64M 0% /dev
tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
192.168.93.193@tcp:/i3qsxbev 1.1T 7.8M 1.1T 1% /data
/dev/nvme1n1p1 20G 3.1G 16G 17% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 6.9G 12K 6.9G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 3.8G 0 3.8G 0% /proc/acpi
tmpfs 3.8G 0 3.8G 0% /proc/scsi
tmpfs 3.8G 0 3.8G 0% /sys/firmware
❯ kubectl exec -it fsx-app -- ls /data
out.txt
We uncovered that this was related to configs that were not set in the 5.15 kernel which were corrected in a recent patch: https://github.com/bottlerocket-os/bottlerocket/pull/2569
You'll notice in that diff:
430:# update lustrefs client,aarch64_aws,aarch64_metal,x86_64_aws,x86_64_metal
431:+LUSTREFSX_FS m,x,x,x,x
432: LUSTREFSX_LIBCFS n -> m,x,x,x,x
433: LUSTREFSX_LNET n -> m,x,x,x,x
434:+LUSTREFSX_LNET_SELFTEST m,x,x,x,x
435:+LUSTREFSX_LNET_XPRT_IB n,x,x,x,x
436:+LUSTRE_DEBUG_EXPENSIVE_CHECK n,x,x,x,x
So, this should be fixed in the next version of Bottlerocket thanks to changes in the kernel above! Cc @markusboehme Feel free to let us know if you need anything else on this!!
Re-opening just so we can track this going into our v1.12.0
🚀
Thanks @jpmcb , do you know when the fixed AMI will be available on aws ?
An intermediate bugfix release would be appreciated if 1.12.0 still takes some time. We're currently seeing some issues WRT to autoscaling that could potentially originate from the node versions being out of sync with the control plane (due to the FSx issue reported here).
Is there any rough date for a new release, being it a feature or patch one?
Thanks for checking in on this - We are targeting the end of January (very soon!) for the 1.12.0 release and we already have the 1.12.x
branch cut.
If you're interested in building an AMI today for testing, you can checkout that branch and use cargo make -e BUILDSYS_VARIANT=aws-k8s-1.24
to build the image locally and cargo make -e BUILDSYS_VARIANT=aws-k8s-1.24 ami
. More details in our BUILDING.md
Thanks for the update @jpmcb!
I tried building locally in the last days but the build failed at the end when compiling os
.
Building env was a Mac M1 Pro with docker 20.10.22 and rust 1.66.1 via cargo make -e BUILDSYS_ARCH=x86_64
.
Probably not worth troubleshooting if 1.12.0 is already on the way. Looking forward to it!
Hi @pat-s , could you please share the error you are getting while building Bottlerocket? (I don't have a Mac, so I'm not able to reproduce :sweat_smile: )
Sure, here's the log. Not sure where the error comes from, AFAICS everything should be built in docker and hence machine-agnostic? Not sure though where the local rust part comes in...
#21 236.6 error: could not compile `sundog`
#21 236.6
#21 236.6 Caused by:
#21 236.6 process didn't exit successfully: `/usr/bin/rustc --crate-name sundog --edition=2018 api/sundog/src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C opt-level=3 -C embed-bitcode=no -C debuginfo=2 -C metadata=f1dda1c0b15e0c90 -C extra-filename=-f1dda1c0b15e0c90 --out-dir /home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps --target x86_64-bottlerocket-linux-gnu -C linker=/usr/bin/x86_64-bottlerocket-linux-gnu-gcc -L dependency=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps -L dependency=/home/builder/.cache/release/deps --extern apiclient=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libapiclient-cee067d451d7b166.rlib --extern constants=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libconstants-37ede6ace73bb42f.rlib --extern datastore=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libdatastore-dbb12debb9ff125d.rlib --extern http=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libhttp-68066c2bc6086886.rlib --extern log=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/liblog-56fb0fccdab81be0.rlib --extern model=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libmodel-0c570bc5f68d5a14.rlib --extern serde=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde-0fe5639f2e2024f0.rlib --extern serde_json=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde_json-43d9398d57cbcc08.rlib --extern simplelog=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsimplelog-192c7c07a4f82e7a.rlib --extern snafu=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsnafu-8194cd9e9fb48158.rlib --extern tokio=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libtokio-0c8ac197f1b09636.rlib -Cprefer-dynamic -Copt-level=3 -Cdebuginfo=2 -Ccodegen-units=1 -Clink-arg=-Wl,-z,relro,-z,now -L native=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/build/ring-e73751118ec72435/out` (signal: 9, SIGKILL: kill)
#21 236.6 warning: build failed, waiting for other jobs to finish...
#21 372.4 error: Bad exit status from /var/tmp/rpm-tmp.t16z5j (%build)
#21 372.4
#21 372.4 RPM build errors:
#21 372.4 Bad exit status from /var/tmp/rpm-tmp.t16z5j (%build)
#21 ERROR: executor failed running [/bin/sh -c rpmbuild -ba --clean --undefine _auto_set_build_flags rpmbuild/SPECS/${PACKAGE}.spec]: exit code: 1
------
> [rpmbuild 7/7] RUN --mount=source=.cargo,target=/home/builder/.cargo --mount=type=cache,target=/home/builder/.cache,from=cache,source=/cache --mount=type=cache,target=/home/builder/rpmbuild/BUILD/sources/models/src/variant,from=variantcache,source=/variantcache --mount=type=cache,target=/home/builder/rpmbuild/BUILD/sources/logdog/conf/current,from=variantcache,source=/variantcache --mount=source=sources,target=/home/builder/rpmbuild/BUILD/sources rpmbuild -ba --clean --undefine _auto_set_build_flags rpmbuild/SPECS/os.spec:
------
executor failed running [/bin/sh -c rpmbuild -ba --clean --undefine _auto_set_build_flags rpmbuild/SPECS/${PACKAGE}.spec]: exit code: 1
--- stderr
BuildAttempt: Failed to execute command: 'docker build . --network none --target package --tag buildsys-pkg-os-x86_64-d4ae984f528b --build-arg PACKAGE=os --build-arg ARCH=x86_64 --build-arg GOARCH=amd64 --build-arg VARIANT=aws-k8s-1.24 --build-arg VARIANT_PLATFORM=aws --build-arg VARIANT_RUNTIME=k8s --build-arg VARIANT_FAMILY=aws-k8s --build-arg VARIANT_FLAVOR= --build-arg REPO=default --build-arg SDK=public.ecr.aws/bottlerocket/bottlerocket-sdk-x86_64:v0.29.0 --build-arg TOOLCHAIN=public.ecr.aws/bottlerocket/bottlerocket-toolchain-x86_64:v0.29.0 --build-arg NOCACHE=3534030891 --build-arg TOKEN=d4ae984f528b'
[cargo-make] ERROR - Error while executing command, exit code: 101
[cargo-make] WARN - Build Failed.
1.12.0 deployed and working, thanks again all for the fix!
Hey, @pat-s I read the logs and this is interesting (I formatted a little the output to make it more readable)
`/usr/bin/rustc --crate-name sundog --edition=2018 \
api/sundog/src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat \
--crate-type bin --emit=dep-info,link -C opt-level=3 -C embed-bitcode=no -C debuginfo=2 \
-C metadata=f1dda1c0b15e0c90 -C extra-filename=-f1dda1c0b15e0c90 \
--out-dir /home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps --target x86_64-bottlerocket-linux-gnu \
-C linker=/usr/bin/x86_64-bottlerocket-linux-gnu-gcc \
-L dependency=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps \
-L dependency=/home/builder/.cache/release/deps \
--extern apiclient=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libapiclient-cee067d451d7b166.rlib \
--extern constants=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libconstants-37ede6ace73bb42f.rlib \
--extern datastore=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libdatastore-dbb12debb9ff125d.rlib \
--extern http=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libhttp-68066c2bc6086886.rlib \
--extern log=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/liblog-56fb0fccdab81be0.rlib \
--extern model=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libmodel-0c570bc5f68d5a14.rlib \
--extern serde=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde-0fe5639f2e2024f0.rlib \
--extern serde_json=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde_json-43d9398d57cbcc08.rlib \
--extern simplelog=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsimplelog-192c7c07a4f82e7a.rlib \
--extern snafu=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsnafu-8194cd9e9fb48158.rlib \
--extern tokio=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libtokio-0c8ac197f1b09636.rlib \
-Cprefer-dynamic -Copt-level=3 -Cdebuginfo=2 -Ccodegen-units=1 -Clink-arg=-Wl,-z,relro,-z,now \
-L native=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/build/ring-e73751118ec72435/out` (signal: 9, SIGKILL: kill)
The process was killed, probably because of limits on the VM that docker starts in MAC, @webern, I remember you had problems with docker in MAC, did you ever experienced something like this?
The process was killed, probably because of limits on the VM that docker starts in MAC, @webern, I remember you had problems with docker in MAC, did you ever experienced something like this?
Let's move this to an issue dedicated to building Bottlerocket on macOS. I'll create it.
1.12.0 deployed and working, thanks again all for the fix!
Glad to hear! Closing this as done
Hmm, in EKS 1.28 Bottlerocket 1.15.1 it seems to be broken again. FATAL: Module lustre not found in directory /lib/modules/6.1.49
It's currently a known limitation of variants utilizing the 6.1 kernel series. Please see https://github.com/bottlerocket-os/bottlerocket/issues/3459 for the issue tracking Lustre support for those.
Image I'm using: v1.24.6-eks-4360b32
Arch: x86_64 Kubernetes: v1.24 aws-fsx-csi-driver:v0.9.0
What I expected to happen: Be able to mount fsx for lustre filesystem.
What actually happened: Both the pod and the fsx csi driver reported the below errors when the pod was trying to mount the filesystem.
How to reproduce the problem: Use bottlerocket v1.24 with fsx csi driver and mount an fsx for lustre filesystem.
However bottlerocket v1.23 works fine.