cloux / aws-devuan

systemd-free GNU/Linux for AWS Cloud Environment
Do What The F*ck You Want To Public License
20 stars 4 forks source link

Kernel support for AMD? #8

Closed jmattsson closed 4 years ago

jmattsson commented 4 years ago

Hello,

I tried to use this AMI in a t3a instance, but discovered that was Not A Good Idea(tm). I quickly encountered hard lockups, and the system log showed me the kernel had "oopsed", and also that it doesn't seem to have AMD support built-in:

[    0.000000] Linux version 5.3.11 (root@cloux.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Wed Nov 13 02:00:34 CET 2019
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.3.11 root=PARTUUID=505e8893-01 ro rootfstype=ext4 console=hvc0 console=ttyS0,115200
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   zhaoxin   Shanghai  
[    0.000000] CPU: vendor_id 'AuthenticAMD' unknown, using generic init.
[    0.000000] CPU: Your system may be unstable.
[  219.411682] BUG: unable to handle page fault for address: ffffffffb4054df2
[  219.413418] #PF: supervisor write access in kernel mode
[  219.414694] #PF: error_code(0x0003) - permissions violation
[  219.416054] PGD 1220e067 P4D 1220e067 PUD 1220f063 PMD 110001e1 
[  219.417530] Oops: 0003 [#1] SMP PTI
...

May I suggest either adding in AMD support, or mentioning in the readme that AMD is a no-go? Thanks!

cloux commented 4 years ago

TLDR: this is a serious issue, a fixed AMI release will be available within a few hours. I will post here more details after it's fixed.

cloux commented 4 years ago

The AMD support is now built in, AMI with the fixed kernel 5.3.12 is Devuan Runit 2019-11-23 (Unofficial). I tested it on t3a.micro (AMD EPYC 7571), it boots fine.

Info:

Until recently, everything on AWS was based on Intel. That allowed me to optimize the kernel for Intel platform, with the risk that if Amazon decided to offer AMD based instance types, it won't boot. And that's exactly what happened now. I am sorry I let you run against the wall here. The fixed kernel is generic and runs on both Intel and AMD. Note, some more exotic CPUs are still disabled: Hygon, Centaur, Zhaoxin.

The main goal of this project is to offer a stable base OS on EC2. I expect it to work well on all instance types. Stability and compatibility are more important than speed. Thank you for reporting this issue!

jmattsson commented 4 years ago

Thank you for maintaining this AMI - it's kind of my go-to at this point. Nice and minimal without much cruft to be removed :)

Did you try doing something a bit more CPU intensive? I was able to boot fine with the Intel-only kernel, but once I started e.g. compiling ZFS everything locked up after a little while.

Should I mention that you can even get ARM instances on EC2 these days? They're the a1 family. And no, I haven't tried them out.

cloux commented 4 years ago

Nice to hear you find my distro useful. I made it this way - minimal, no bloat, and yet with all the usual tools preinstalled. Kind of minimal "batteries included" general purpose OS.

I am doing something CPU intensive regularly - I compile the kernel on a EC2 instance. That being said, I never compiled ZFS, but I experienced lockups before - in my case this happened when PHP exhausted all RAM. This is not an issue specific to my distro, it is a Linux "feature". There are several ways to address this:

1) Use instance type with more RAM 1) Mount some SWAP space. You can enable swapfile autorun to create swap on instance start 1) check out OOMD, it is preinstalled. NOTE: this is a last-resort solution. It might kill some processes, but at least your instance will survive.

So far, I only maintain the kernel for x86 platform, and have no intention to support other platforms like ARM.

jmattsson commented 4 years ago

Somewhat tangentially, do you have the actual AMI build scripts available somewhere? I'd love to see how you've automated the whole thing (as I may need to do something similar).

cloux commented 4 years ago

You got me. I am very open about what I do here, but this setup is pretty much the only thing that I didn't published. The reason: this AMI build is being done by scripts that are very specific to my release and would be useless to anybody else in this form. That being said, there are a few quirks that I had to solve, so I will probably publish some interesting code snippets later. I am working on a new website for that purpose. Big parts of what is required is already public in my repos.

The process itself is very simple. One shell script that uses awscli and a few jq tricks to build a new updated AMI from the latest published image, in a rolling release manner. The whole build is done automatically within AWS. No other special tools for AMI building like VirtualBox or Packer are used. The idea is:

Sample awscli + jq usage:

# FUNCTION: info about INSTANCE
get_instance_info () {
    INSTANCE_INFO=$(aws ec2 describe-instances --profile default --instance-ids "$1" \
    --query 'Reservations[*].Instances[*].{State:State.Name,PublicIP:PublicIpAddress}' 2>/dev/null)
    instance_state=$(printf '%s' "$INSTANCE_INFO" | jq -r '.[][].State')
    instance_publicip=$(printf '%s' "$INSTANCE_INFO" | jq -r '.[][].PublicIP')
}
# FUNCTION: info about AMI
get_image_state () {
    image_state=$(aws ec2 describe-images --profile default --image-ids "$1" 2>/dev/null |
                  jq -r '.Images[0].State')
}
jmattsson commented 4 years ago

Fantastic, thank you for sharing!