bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.69k stars 512 forks source link

Bottlerocket v1.24 is not able to mount fsx for lustre filesystem #2685

Closed wael-sadek closed 1 year ago

wael-sadek commented 1 year ago

Image I'm using: v1.24.6-eks-4360b32

Arch: x86_64 Kubernetes: v1.24 aws-fsx-csi-driver:v0.9.0

What I expected to happen: Be able to mount fsx for lustre filesystem.

What actually happened: Both the pod and the fsx csi driver reported the below errors when the pod was trying to mount the filesystem.

E1223 21:41:09.905407       1 driver.go:86] GRPC error: rpc error: code = Internal desc = Could not mount "fs-070a48fa26621de87.fsx.eu-central-1.amazonaws.com@tcp:/knkipbmv" at "/var/lib/kubelet/pods/ede9090b-0fc4-4cfc-9398-711fc923601f/volumes/kubernetes.io~csi/pvc-workbench-home-fsx/mount": mount failed: exit status 19
Mounting command: mount
Mounting arguments: -t lustre -o flock fs-070a48fa26621de87.fsx.eu-central-1.amazonaws.com@tcp:/knkipbmv /var/lib/kubelet/pods/ede9090b-0fc4-4cfc-9398-711fc923601f/volumes/kubernetes.io~csi/pvc-workbench-home-fsx/mount
Output: mount.lustre: mount fs-070a48fa26621de87.fsx.eu-central-1.amazonaws.com@tcp:/knkipbmv at /var/lib/kubelet/pods/ede9090b-0fc4-4cfc-9398-711fc923601f/volumes/kubernetes.io~csi/pvc-workbench-home-fsx/mount failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems

How to reproduce the problem: Use bottlerocket v1.24 with fsx csi driver and mount an fsx for lustre filesystem.

However bottlerocket v1.23 works fine.

jpmcb commented 1 year ago

Thanks for raising this issue - I'm not super familiar with this CSI driver, but I'm wondering if it's related to this issue in the upstream kubernetes SIG?

https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/243

The error:

Are the lustre modules loaded?

Makes me think that the mount is attempted before the daemonset /modules are ready. Are you able to apply a taint that will wait for the daemonset to be ready?

I can also spin up a quick cluster and give this a try.

wael-sadek commented 1 year ago

Makes me think that the mount is attempted before the daemonset /modules are ready. Are you able to apply a taint that will wait for the daemonset to be ready?

The daemonset was running, that was checked.

jpmcb commented 1 year ago

I'm getting similar results on my test cluster running through the basic tutorial. I'm not sure if this is related to bottlerocket specifically, but we can get more insight from that team in a new issue I've created: https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/289

arnaldo2792 commented 1 year ago

Hello @wael-sadek, for your v1.23 testing, did you use the same Bottlerocket version as with v1.24? You can check the bottlerocket version by running apiclient get os from the control container. I'm asking because I want to pin down whether the problem is with a specific Bottlerocket version, or, with the v1.24 kubelet sources.

wael-sadek commented 1 year ago

@arnaldo2792

ssm-user@control]$ apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "104f8e0f",
    "pretty_name": "Bottlerocket OS 1.11.1 (aws-k8s-1.23)",
    "variant_id": "aws-k8s-1.23",
    "version_id": "1.11.1"
  }
}
ssm-user@control]$ apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "104f8e0f",
    "pretty_name": "Bottlerocket OS 1.11.1 (aws-k8s-1.24)",
    "variant_id": "aws-k8s-1.24",
    "version_id": "1.11.1"
  }
}

So it's using the same bottlerocket version, I wonder why it works with v1.23 but not with v1.24 ? Could it be some customization done along with the update ?

jpmcb commented 1 year ago

I built new 1.24 variants locally and deployed those into a cluster:

❯ kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-192-168-146-131.us-west-2.compute.internal   Ready    <none>   76m   v1.24.9-eks-4f83af2
ip-192-168-5-150.us-west-2.compute.internal     Ready    <none>   76m   v1.24.9-eks-4f83af2

Then, I went through the sample application in the tutorial and was able to successfully deploy the filesystem:

❯ kubectl exec -ti fsx-app -- df -h
Filesystem                    Size  Used Avail Use% Mounted on
overlay                        20G  3.1G   16G  17% /
tmpfs                          64M     0   64M   0% /dev
tmpfs                         3.8G     0  3.8G   0% /sys/fs/cgroup
192.168.93.193@tcp:/i3qsxbev  1.1T  7.8M  1.1T   1% /data
/dev/nvme1n1p1                 20G  3.1G   16G  17% /etc/hosts
shm                            64M     0   64M   0% /dev/shm
tmpfs                         6.9G   12K  6.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                         3.8G     0  3.8G   0% /proc/acpi
tmpfs                         3.8G     0  3.8G   0% /proc/scsi
tmpfs                         3.8G     0  3.8G   0% /sys/firmware

❯ kubectl exec -it fsx-app -- ls /data
out.txt

We uncovered that this was related to configs that were not set in the 5.15 kernel which were corrected in a recent patch: https://github.com/bottlerocket-os/bottlerocket/pull/2569

You'll notice in that diff:

430:# update lustrefs client,aarch64_aws,aarch64_metal,x86_64_aws,x86_64_metal
431:+LUSTREFSX_FS m,x,x,x,x
432: LUSTREFSX_LIBCFS n -> m,x,x,x,x
433: LUSTREFSX_LNET n -> m,x,x,x,x
434:+LUSTREFSX_LNET_SELFTEST m,x,x,x,x
435:+LUSTREFSX_LNET_XPRT_IB n,x,x,x,x
436:+LUSTRE_DEBUG_EXPENSIVE_CHECK n,x,x,x,x

So, this should be fixed in the next version of Bottlerocket thanks to changes in the kernel above! Cc @markusboehme Feel free to let us know if you need anything else on this!!

jpmcb commented 1 year ago

Re-opening just so we can track this going into our v1.12.0 🚀

wael-sadek commented 1 year ago

Thanks @jpmcb , do you know when the fixed AMI will be available on aws ?

pat-s commented 1 year ago

An intermediate bugfix release would be appreciated if 1.12.0 still takes some time. We're currently seeing some issues WRT to autoscaling that could potentially originate from the node versions being out of sync with the control plane (due to the FSx issue reported here).

Is there any rough date for a new release, being it a feature or patch one?

jpmcb commented 1 year ago

Thanks for checking in on this - We are targeting the end of January (very soon!) for the 1.12.0 release and we already have the 1.12.x branch cut.

If you're interested in building an AMI today for testing, you can checkout that branch and use cargo make -e BUILDSYS_VARIANT=aws-k8s-1.24 to build the image locally and cargo make -e BUILDSYS_VARIANT=aws-k8s-1.24 ami. More details in our BUILDING.md

pat-s commented 1 year ago

Thanks for the update @jpmcb!

I tried building locally in the last days but the build failed at the end when compiling os. Building env was a Mac M1 Pro with docker 20.10.22 and rust 1.66.1 via cargo make -e BUILDSYS_ARCH=x86_64.

Probably not worth troubleshooting if 1.12.0 is already on the way. Looking forward to it!

arnaldo2792 commented 1 year ago

Hi @pat-s , could you please share the error you are getting while building Bottlerocket? (I don't have a Mac, so I'm not able to reproduce :sweat_smile: )

pat-s commented 1 year ago

Sure, here's the log. Not sure where the error comes from, AFAICS everything should be built in docker and hence machine-agnostic? Not sure though where the local rust part comes in...

  #21 236.6 error: could not compile `sundog`
  #21 236.6 
  #21 236.6 Caused by:
  #21 236.6   process didn't exit successfully: `/usr/bin/rustc --crate-name sundog --edition=2018 api/sundog/src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C opt-level=3 -C embed-bitcode=no -C debuginfo=2 -C metadata=f1dda1c0b15e0c90 -C extra-filename=-f1dda1c0b15e0c90 --out-dir /home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps --target x86_64-bottlerocket-linux-gnu -C linker=/usr/bin/x86_64-bottlerocket-linux-gnu-gcc -L dependency=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps -L dependency=/home/builder/.cache/release/deps --extern apiclient=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libapiclient-cee067d451d7b166.rlib --extern constants=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libconstants-37ede6ace73bb42f.rlib --extern datastore=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libdatastore-dbb12debb9ff125d.rlib --extern http=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libhttp-68066c2bc6086886.rlib --extern log=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/liblog-56fb0fccdab81be0.rlib --extern model=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libmodel-0c570bc5f68d5a14.rlib --extern serde=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde-0fe5639f2e2024f0.rlib --extern serde_json=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde_json-43d9398d57cbcc08.rlib --extern simplelog=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsimplelog-192c7c07a4f82e7a.rlib --extern snafu=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsnafu-8194cd9e9fb48158.rlib --extern tokio=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libtokio-0c8ac197f1b09636.rlib -Cprefer-dynamic -Copt-level=3 -Cdebuginfo=2 -Ccodegen-units=1 -Clink-arg=-Wl,-z,relro,-z,now -L native=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/build/ring-e73751118ec72435/out` (signal: 9, SIGKILL: kill)
  #21 236.6 warning: build failed, waiting for other jobs to finish...
  #21 372.4 error: Bad exit status from /var/tmp/rpm-tmp.t16z5j (%build)
  #21 372.4 
  #21 372.4 RPM build errors:
  #21 372.4     Bad exit status from /var/tmp/rpm-tmp.t16z5j (%build)
  #21 ERROR: executor failed running [/bin/sh -c rpmbuild -ba --clean       --undefine _auto_set_build_flags       rpmbuild/SPECS/${PACKAGE}.spec]: exit code: 1
  ------
   > [rpmbuild 7/7] RUN --mount=source=.cargo,target=/home/builder/.cargo     --mount=type=cache,target=/home/builder/.cache,from=cache,source=/cache     --mount=type=cache,target=/home/builder/rpmbuild/BUILD/sources/models/src/variant,from=variantcache,source=/variantcache     --mount=type=cache,target=/home/builder/rpmbuild/BUILD/sources/logdog/conf/current,from=variantcache,source=/variantcache     --mount=source=sources,target=/home/builder/rpmbuild/BUILD/sources     rpmbuild -ba --clean       --undefine _auto_set_build_flags       rpmbuild/SPECS/os.spec:
  ------
  executor failed running [/bin/sh -c rpmbuild -ba --clean       --undefine _auto_set_build_flags       rpmbuild/SPECS/${PACKAGE}.spec]: exit code: 1

  --- stderr
  BuildAttempt: Failed to execute command: 'docker build . --network none --target package --tag buildsys-pkg-os-x86_64-d4ae984f528b --build-arg PACKAGE=os --build-arg ARCH=x86_64 --build-arg GOARCH=amd64 --build-arg VARIANT=aws-k8s-1.24 --build-arg VARIANT_PLATFORM=aws --build-arg VARIANT_RUNTIME=k8s --build-arg VARIANT_FAMILY=aws-k8s --build-arg VARIANT_FLAVOR= --build-arg REPO=default --build-arg SDK=public.ecr.aws/bottlerocket/bottlerocket-sdk-x86_64:v0.29.0 --build-arg TOOLCHAIN=public.ecr.aws/bottlerocket/bottlerocket-toolchain-x86_64:v0.29.0 --build-arg NOCACHE=3534030891 --build-arg TOKEN=d4ae984f528b'
[cargo-make] ERROR - Error while executing command, exit code: 101
[cargo-make] WARN - Build Failed.
pat-s commented 1 year ago

1.12.0 deployed and working, thanks again all for the fix!

arnaldo2792 commented 1 year ago

Hey, @pat-s I read the logs and this is interesting (I formatted a little the output to make it more readable)

`/usr/bin/rustc --crate-name sundog --edition=2018 \
api/sundog/src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat \
  --crate-type bin --emit=dep-info,link -C opt-level=3 -C embed-bitcode=no -C debuginfo=2 \
  -C metadata=f1dda1c0b15e0c90 -C extra-filename=-f1dda1c0b15e0c90 \
  --out-dir /home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps --target x86_64-bottlerocket-linux-gnu \
  -C linker=/usr/bin/x86_64-bottlerocket-linux-gnu-gcc \
  -L dependency=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps \
  -L dependency=/home/builder/.cache/release/deps \
  --extern apiclient=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libapiclient-cee067d451d7b166.rlib \
  --extern constants=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libconstants-37ede6ace73bb42f.rlib \
  --extern datastore=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libdatastore-dbb12debb9ff125d.rlib \
  --extern http=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libhttp-68066c2bc6086886.rlib \
  --extern log=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/liblog-56fb0fccdab81be0.rlib \
  --extern model=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libmodel-0c570bc5f68d5a14.rlib \
  --extern serde=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde-0fe5639f2e2024f0.rlib \
  --extern serde_json=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libserde_json-43d9398d57cbcc08.rlib \
  --extern simplelog=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsimplelog-192c7c07a4f82e7a.rlib \
  --extern snafu=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libsnafu-8194cd9e9fb48158.rlib \
  --extern tokio=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/deps/libtokio-0c8ac197f1b09636.rlib \
  -Cprefer-dynamic -Copt-level=3 -Cdebuginfo=2 -Ccodegen-units=1 -Clink-arg=-Wl,-z,relro,-z,now \
  -L native=/home/builder/.cache/x86_64-bottlerocket-linux-gnu/release/build/ring-e73751118ec72435/out` (signal: 9, SIGKILL: kill)

The process was killed, probably because of limits on the VM that docker starts in MAC, @webern, I remember you had problems with docker in MAC, did you ever experienced something like this?

webern commented 1 year ago

The process was killed, probably because of limits on the VM that docker starts in MAC, @webern, I remember you had problems with docker in MAC, did you ever experienced something like this?

Let's move this to an issue dedicated to building Bottlerocket on macOS. I'll create it.

jpmcb commented 1 year ago

1.12.0 deployed and working, thanks again all for the fix!

Glad to hear! Closing this as done

autarchprinceps commented 1 year ago

Hmm, in EKS 1.28 Bottlerocket 1.15.1 it seems to be broken again. FATAL: Module lustre not found in directory /lib/modules/6.1.49

markusboehme commented 1 year ago

It's currently a known limitation of variants utilizing the 6.1 kernel series. Please see https://github.com/bottlerocket-os/bottlerocket/issues/3459 for the issue tracking Lustre support for those.