[AWS] Creating Clear Linux based AMI followed by Segfaults on new instance

m1keil commented 3 years ago

I'm chasing a weird issue in which Clear Linux based AMIs that I create end up with segfaults during instance startup. These segfaults prevent services such ucd from running. Can't provide core dump either as it looks like systemd-coredump segfults as well.

Reproducing

Start with ami-07d9f0e2130e032fe (all of this done in ap-southeast-2 region) and launch t3 family instance (I used t3.medium).
Once the instance is up, create an image from it.
Launch a new t2 family instance (I used t2.medium) from the new AMI you created in the previous step.
Connect and inspect the journal:

kernel: ucd[159]: segfault at fffffffffffffffd ip 00007f9ee6334424 sp 00007fff2a191600 error 5 in libglib-2.0.so.0.7000.0[7f9ee62b3000+e1000]
kernel: Code: 48 89 44 24 28 48 89 05 d2 da 0e 00 48 3d ff 01 00 00 0f 86 96 04 00 00 c4 e2 f8 f3 4c 24 28 48 89 44 24 30 0f 85 9c 04 00 00 <62> 61 fd 28 6f 3d 52 d0 0e 00 48 8d 3d 96 77 07 00 62 61 fe 28 7f

systemd[1]: ucd.service: Main process exited, code=dumped, status=11/SEGV
systemd[1]: ucd.service: Failed with result 'core-dump'.
systemd[1]: Failed to start micro-config-drive job.

kernel: systemd-coredum[174]: segfault at 1 ip 00007f7b2c21e825 sp 00007ffc98715820 error 4 in libzstd.so.1.5.0[7f7b2c213000+108000]

kernel: Code: 6d 28 77 71 62 01 05 00 ef ff 62 61 7f 28 7f 7d 01 62 61 7f 28 7f 78 01 48 85 f6 74 0d 4d 85 e4 75 7a 48 89 f7 e8 0b 48 ff ff <62> 01 05 00 ef ff 62 61 7f 28 7f bb a8 0d 00 00 62 61 fd 08 7e bb

If you try to execute ucd directly:

clear@clr-ec2155ad384ffde7c446c5ce193854ff~ $ /usr/bin/ucd
Segmentation fault (core dumped)

This doesn't seem to happen in all of the instances. I managed to run the m5 family instance from the image without issues but it seems to happen again if trying to start t3a family one:

kernel: traps: ucd[153] trap invalid opcode ip:7fee43bb7424 sp:7ffde964d740 error:0 in libglib-2.0.so.0.7000.0[7fee43b36000+e1000]

systemd[1]: ucd.service: Main process exited, code=dumped, status=4/ILL
systemd[1]: ucd.service: Failed with result 'core-dump'.
systemd[1]: Failed to start micro-config-drive job.

Workaround If I create AMI from a t2 family, it seems to start without issues on the t3 or t3a families.

Is this use case even supported? I didn't find much info online about creating new AMIs based on Clear Linux. And UCD docs don't explain how to "prepare" an instance before making a new AMI (like the need to remove /var/lib/cloud/aws-user-data)

fenrus75 commented 3 years ago

interesting .. which release number are you using for this? we're currently changing a bit how avx2/avx512 support is done, and it could be that you lock that in at image create time (say on an AVX512 capable machine) but then run it on an AVX2-only machine ?

On Fri, Oct 15, 2021 at 9:26 AM Michael Sverdlik @.***> wrote:

I'm chasing a weird issue in which Clear Linux based AMIs that I create end up with segfaults during instance startup. These segfaults prevent services such ucd from running. Can't provide core dump either as it looks like systemd-coredump segfults as well.

Reproducing

Start with ami-07d9f0e2130e032fe (all of this done in ap-southeast-2 region) and launch t3 family instance (I used t3.medium).

Once the instance is up, create an image from it.

Launch a new t2 family instance (I used t2.medium) from the new AMI you created in the previous step.

Connect and inspect the journal:

kernel: ucd[159]: segfault at fffffffffffffffd ip 00007f9ee6334424 sp 00007fff2a191600 error 5 in libglib-2.0.so.0.7000.0[7f9ee62b3000+e1000] kernel: Code: 48 89 44 24 28 48 89 05 d2 da 0e 00 48 3d ff 01 00 00 0f 86 96 04 00 00 c4 e2 f8 f3 4c 24 28 48 89 44 24 30 0f 85 9c 04 00 00 <62> 61 fd 28 6f 3d 52 d0 0e 00 48 8d 3d 96 77 07 00 62 61 fe 28 7f

systemd[1]: ucd.service: Main process exited, code=dumped, status=11/SEGV systemd[1]: ucd.service: Failed with result 'core-dump'. systemd[1]: Failed to start micro-config-drive job.

kernel: systemd-coredum[174]: segfault at 1 ip 00007f7b2c21e825 sp 00007ffc98715820 error 4 in libzstd.so.1.5.0[7f7b2c213000+108000]

kernel: Code: 6d 28 77 71 62 01 05 00 ef ff 62 61 7f 28 7f 7d 01 62 61 7f 28 7f 78 01 48 85 f6 74 0d 4d 85 e4 75 7a 48 89 f7 e8 0b 48 ff ff <62> 01 05 00 ef ff 62 61 7f 28 7f bb a8 0d 00 00 62 61 fd 08 7e bb

If you try to execute ucd directly:

@.***~ $ /usr/bin/ucd Segmentation fault (core dumped)

This doesn't seem to happen in all of the instances. I managed to run the m5 family instance from the image without issues but it seems to happen again if trying to start t3a family one:

kernel: traps: ucd[153] trap invalid opcode ip:7fee43bb7424 sp:7ffde964d740 error:0 in libglib-2.0.so.0.7000.0[7fee43b36000+e1000]

systemd[1]: ucd.service: Main process exited, code=dumped, status=4/ILL systemd[1]: ucd.service: Failed with result 'core-dump'. systemd[1]: Failed to start micro-config-drive job.

Workaround If I create AMI from a t2 family, it seems to start without issues on the t3 or t3a families without a problem.

Is this use case even supported? I didn't found much info online about creating new AMIs based on Clear Linux. And UCD docs don't explain how to "prepare" an instance before making AMI.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/2449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FMH546L7GECEGNGFUTUHBI47ANCNFSM5GCKNKYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

m1keil commented 3 years ago

AMI:

September 3, 2021 at 5:46:35 AM UTC+10
clear-35000-aws-4d505f99-1c49-45c1-b574-a59747ec19d6

Clear Linux

$ cat /etc/os-release
NAME="Clear Linux OS"
VERSION=1
ID=clear-linux-os
ID_LIKE=clear-linux-os
VERSION_ID=35000
PRETTY_NAME="Clear Linux OS"
ANSI_COLOR="1;35"
HOME_URL="https://clearlinux.org"
SUPPORT_URL="https://clearlinux.org"
BUG_REPORT_URL="mailto:dev@lists.clearlinux.org"
PRIVACY_POLICY_URL="http://www.intel.com/privacy"
BUILD_ID=35000

interesting .. which release number are you using for this? we're currently changing a bit how avx2/avx512 support is done, and it could be that you lock that in at image create time (say on an AVX512 capable machine) but then run it on an AVX2-only machine ?

Yea if this happens during the first boot, and never get re-evaluated, could be. If there's any way I can instruct it to re-check the available instructions it can be a quick test to see if this is the case.

fenrus75 commented 3 years ago

on that version, you can just delete /usr/lib64/haswell/avx512_1 and /usr/bin/haswell

directories from the image... then all the avx512 stuff is gone

(this is changing in future builds but this lets it at least narrow down)

On Fri, Oct 15, 2021 at 9:40 AM Michael Sverdlik @.***> wrote:

AMI:

September 3, 2021 at 5:46:35 AM UTC+10 clear-35000-aws-4d505f99-1c49-45c1-b574-a59747ec19d6

Clear Linux

$ cat /etc/os-release NAME="Clear Linux OS" VERSION=1 ID=clear-linux-os ID_LIKE=clear-linux-os VERSION_ID=35000 PRETTY_NAME="Clear Linux OS" ANSI_COLOR="1;35" HOME_URL="https://clearlinux.org" SUPPORT_URL="https://clearlinux.org" @.***" PRIVACY_POLICY_URL="http://www.intel.com/privacy" BUILD_ID=35000```

interesting .. which release number are you using for this? we're currently changing a bit how avx2/avx512 support is done, and it could be that you lock that in at image create time (say on an AVX512 capable machine) but then run it on an AVX2-only machine ?

Yea I this happens during the first boot, and never get re-evaluated, could be I guess? If there's any way I can instruct it to re-check the available instructions it can be a quick test to see if this is the case.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/2449#issuecomment-944442432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FLDN66GOE4WTWKA7UTUHBKPTANCNFSM5GCKNKYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

m1keil commented 3 years ago

I didn't find /usr/bin/haswell, any chance you meant /usr/lib64/haswell ?

I tried removing /usr/lib64/haswell/avx512_1/ and the entire /usr/lib64/haswell as a followup but no changes.

I can indeed see that ld links ucd differently on t2 vs t3 but.. doesn't look like it did the trick.

thiagomacieira commented 3 years ago

kernel: ucd[159]: segfault at fffffffffffffffd ip 00007f9ee6334424 sp 00007fff2a191600 error 5 in libglib-2.0.so.0.7000.0[7f9ee62b3000+e1000]
kernel: Code: 48 89 44 24 28 48 89 05 d2 da 0e 00 48 3d ff 01 00 00 0f 86 96 04 00 00 c4 e2 f8 f3 4c 24 28 48 89 44 24 30 0f 85 9c 04 00 00 <62> 61 fd 28 6f 3d 52 d0 0e 00 48 8d 3d 96 77 07 00 62 61 fe 28 7f

The <62> highlighted there is an EVEX (AVX512) prefix. The kernel is also saying it has to be libglib-2.0.so.0

fenrus75 commented 3 years ago

yeah the question is how that glib got there.. it's supposed to be in avx512_1 subdir only

On Fri, Oct 15, 2021 at 3:36 PM Thiago Macieira @.***> wrote:

kernel: ucd[159]: segfault at fffffffffffffffd ip 00007f9ee6334424 sp 00007fff2a191600 error 5 in libglib-2.0.so.0.7000.0[7f9ee62b3000+e1000] kernel: Code: 48 89 44 24 28 48 89 05 d2 da 0e 00 48 3d ff 01 00 00 0f 86 96 04 00 00 c4 e2 f8 f3 4c 24 28 48 89 44 24 30 0f 85 9c 04 00 00 <62> 61 fd 28 6f 3d 52 d0 0e 00 48 8d 3d 96 77 07 00 62 61 fe 28 7f

The <62> highlighted there is an EVEX (AVX512) prefix. The kernel is also saying it has to be libglib-2.0.so.0

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/2449#issuecomment-944786151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FJT2YD2TTCHLFHAF5DUHCUFXANCNFSM5GCKNKYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

m1keil commented 3 years ago

Happy to provide/test any other info, just let me know. Much appreciate the quick response as well.

fenrus75 commented 3 years ago

eh maybe random question but are you using an initrd?

On Fri, Oct 15, 2021 at 8:30 PM Michael Sverdlik @.***> wrote:

Happy to provide/test any other info, just let me know. Much appreciate the quick response as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/2449#issuecomment-944848033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FOTQWZIYXLVEGLSMTDUHDWTXANCNFSM5GCKNKYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

m1keil commented 3 years ago

Ummm... I'm running the stock image on AWS without any major tweaks. I think it defaults to no initrd:

$ cat /proc/cmdline
BOOT_IMAGE=org.clearlinux.aws.5.13.19-298 root=PARTUUID=2c263f03-a510-4a0a-a3e4-24813b731eaa quiet console=tty0 console=ttyS0,115200n8 cryptomgr.notests init=/usr/bin/initra-aws initcall_debug no_timer_check noreplace-smp rcupdate.rcu_expedited=1 rootfstype=ext4 tsc=reliable rw

clearlinux / distribution

[AWS] Creating Clear Linux based AMI followed by Segfaults on new instance #2449