flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
698 stars 30 forks source link

amd64: arm64: hang during file descriptor close #1044

Open ader1990 opened 1 year ago

ader1990 commented 1 year ago

Description

First issue:

Baremetal Flatcar AMD64 hangs during ceph installation (containerized ceph installation using rook-ceph kubernetes). Debugging showed that this line hangs, during mon_osd_crush_smoke_test stage: https://github.com/ceph/ceph/blob/389ea1666327f508d01b9318718c70eec0ed6972/src/common/fork_function.h#L66

When mon_osd_crush_smoke_test = false was set, the issue did not reproduce. Same containerized environment on Ubuntu 22.04 did not reproduce the issue. Same hardware, same identical container images, same worfklow -- which means the OS is most likely having the issue.

Second issue:

Virtual Flatcar ARM64 (qemu-kvm virtual machine) hangs during flatcar-install run. Debugging showed that the command udevadm settle hangs.

I think these two issues are related, because it looks to be a filedescriptor close problem.

More investigation is required, I will add more information while it becomes available.

Did anyone encounter this behaviour?

Impact

TBD.

Environment and steps to reproduce

First issue:

During ceph installation on baremetal (Lenovo baremetal) + virtual (qemu kvm), amd64, ceph osd goes to 100% cpu for no aparrent reason. If we set mon_osd_crush_smoke_test = false, the issue is not present anymore.

After some debugging, this line hangs forever: https://github.com/ceph/ceph/blob/389ea1666327f508d01b9318718c70eec0ed6972/src/common/fork_function.h#L66 . the ceph osd container gets killed after a while because it was not ready, and gets recreated and then same scenario applies. The 100% CPU usage is due to the fact the code tries to retry the fdclose without any maximum retry count.

As a consequence, ceph cannot be deployed on Flatcar without the flag override from the global config:

    [global]
    mon_osd_crush_smoke_test = false

Second issue:

During installation of flatcar on arm64 qemu kvm, using flatcar-install that uses channel | version. After executing the flatcar-install bash script, the following line from the bash script flatcar-install hangs: udevadm settle.

Expected behavior

To not hang.

Additional information

There were no useful logs in the system logs, dmesg or any hung task error message usually thrown by the kernel. Tried with ALL channels versions of Flatcar available, it looks to be a generic issue for all the Flatcar images.

pothos commented 1 year ago

These are two separate things, or? One is udevadm settle hanging, one is the ceph container? From where do you run flatcar-install - from a PXE booted image, i.e., from RAM and where do you install to? The ceph container issue could be related to the kernel or containerd version, what release did you use?

ader1990 commented 1 year ago

These are two separate things, or? One is udevadm settle hanging, one is the ceph container? From where do you run flatcar-install - from a PXE booted image, i.e., from RAM and where do you install to? The ceph container issue could be related to the kernel or containerd version, what release did you use?

Further investigation is required to see if actually udevadm settle is actually hanging during an fd close too (I ll need to manually build a debug version to see where exactly it does hang).

I just presented these two issues in the same bug report because they seem to have the same cause. I will split them up when it s proven that they do not have the same cause.

flatcar-install was run manually from a KVM VM ARM64, VM running the image provided by https://alpha.release.flatcar-linux.net/arm64-usr/current/flatcar_production_qemu_uefi_image.img.bz2

The ceph issue is definitely not a containerd issue, as we used the same CAPI image binaries for both Flatcar / Ubuntu (curtoasy of the capi image builder project - https://github.com/kubernetes-sigs/image-builder). It might be a kernel issue, definitely, we need to find a way to reproduce it outside of the whole ceph code first.

These two issues are not blockers (as can be avoided by minimal code changes), but may be caused by an underlying issue that might produce non-reproducible side effects or down-right bugs.

Thank you, Adrian Vladu.

pothos commented 1 year ago

The flatcar-install bash regression got addressed https://github.com/flatcar/Flatcar/issues/1059