awslabs / aws-c-http

C99 implementation of the HTTP/1.1 and HTTP/2 specifications
Apache License 2.0
136 stars 42 forks source link

7 tests fail on aarch64 when building for Fedora 39, 40 and rawhide #473

Closed wombelix closed 3 months ago

wombelix commented 4 months ago

Describe the bug

Hi,

I'm packaging aws-c-http for Fedora and EPEL. I run the unit tests as part of the build process. 7 tests fail on Fedora (39, 49, rawhide) on aarch64. They run successful on x86_64. Also EPEL8 (RHEL8) and EPEL9 (RHEL9) is fine on both architectures.

I suspect that the problem is related to the OpenSSL version and a behaviour specific to aarch64. I'm looking for help how to further troubleshoot and fix the issue. I'm not an expert in C and have a hard time to make sense out of the stacktrace.

I also packaged the dependencies like aws-c-common (https://src.fedoraproject.org/rpms/aws-c-common) or s2n-tls (https://src.fedoraproject.org/rpms/s2n-tls). All of them use BUILD_SHARED_LIBS=ON. I don't use aws-lc, I build against the system crypto OpenSSL instead.

Expected Behavior

Passed unit tests on Fedora 39, 40 and rawhide on aarch64, similar as on x86_64

Current Behavior

Works

EPEL8:

gcc                           aarch64  8.5.0-22.el8_10

-- Found crypto: /usr/lib64/libcrypto.so  
-- LibCrypto Include Dir: /usr/include
-- LibCrypto Shared Lib:  /usr/lib64/libcrypto.so
-- LibCrypto Static Lib:  crypto_STATIC_LIBRARY-NOTFOUND
-- Found OpenSSL: /usr/lib64/libcrypto.so (found version "1.1.1k")  

EPEL9:

gcc                       aarch64  11.4.1-3.el9

-- Found crypto: /usr/lib64/libcrypto.so  
-- LibCrypto Include Dir: /usr/include
-- LibCrypto Shared Lib:  /usr/lib64/libcrypto.so
-- LibCrypto Static Lib:  crypto_STATIC_LIBRARY-NOTFOUND
-- Found OpenSSL: /usr/lib64/libcrypto.so (found version "3.0.7")  

Fails

F39:

gcc                        aarch64    13.3.1-1.fc39

-- Found crypto: /usr/lib64/libcrypto.so  
-- LibCrypto Include Dir: /usr/include
-- LibCrypto Shared Lib:  /usr/lib64/libcrypto.so
-- LibCrypto Static Lib:  crypto_STATIC_LIBRARY-NOTFOUND
-- Found OpenSSL: /usr/lib64/libcrypto.so (found version "3.1.1")  

F40:

gcc                     aarch64 14.1.1-7.fc40

-- Found crypto: /usr/lib64/libcrypto.so  
-- LibCrypto Include Dir: /usr/include
-- LibCrypto Shared Lib:  /usr/lib64/libcrypto.so
-- LibCrypto Static Lib:  crypto_STATIC_LIBRARY-NOTFOUND
-- Found OpenSSL: /usr/lib64/libcrypto.so (found version "3.2.1")  

F41:

gcc                     aarch64 14.1.1-7.fc41

-- Found crypto: /usr/lib64/libcrypto.so  
-- LibCrypto Include Dir: /usr/include
-- LibCrypto Shared Lib:  /usr/lib64/libcrypto.so
-- LibCrypto Static Lib:  crypto_STATIC_LIBRARY-NOTFOUND
-- Found OpenSSL: /usr/lib64/libcrypto.so (found version "3.2.2")  

The latest copr build for all architectures and OS versions: https://copr.fedorainfracloud.org/coprs/wombelix/aws-c-libs/build/7722282/

The relevant log outputs from fedora-rawhide-x86_64:

99% tests passed, 7 tests failed out of 664

Total Test time (real) =   7.52 sec

The following tests FAILED:
    601 - h1_server_close_before_message_is_sent (Failed)
    602 - h1_server_error_from_incoming_request_callback_stops_decoder (Failed)
    603 - h1_server_error_from_incoming_headers_callback_stops_decoder (Failed)
    604 - h1_server_error_from_incoming_headers_done_callback_stops_decoder (Failed)
    605 - h1_server_error_from_incoming_request_done_callback_stops_decoder (Failed)
    606 - h1_server_error_from_incoming_body_callback_stops_decoder (Failed)
    607 - h1_server_error_from_outgoing_body_callback_stops_sending (Failed)
Errors while running CTest

Mapping of the failed test to its location in the aws-c-http source:

The following tests FAILED: 601 - h1_server_close_before_message_is_sent (Failed) https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L593 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1591

602 - h1_server_error_from_incoming_request_callback_stops_decoder (Failed)

https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L594 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1629

603 - h1_server_error_from_incoming_headers_callback_stops_decoder (Failed)

https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L595 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1635

604 - h1_server_error_from_incoming_headers_done_callback_stops_decoder (Failed)

https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L596 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1641

605 - h1_server_error_from_incoming_request_done_callback_stops_decoder (Failed)

https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L597 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1653

606 - h1_server_error_from_incoming_body_callback_stops_decoder (Failed)

https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L598 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1647

607 - h1_server_error_from_outgoing_body_callback_stops_sending (Failed)

https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L599 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1658

All seem to have in common that they use:

s_test_error_from_callback https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1523

Stracktrace attached, too long to add it inline:

stacktrace_aws-c-http_f41_rawhide.txt

Reproduction Steps

Build and run the unit tests on a Fedora 30, 40 or rawhide aarch64 system.

Possible Solution

No response

Additional Information/Context

No response

aws-c-http version used

0.8.2

Compiler and version used

13.3.1-1.fc39, 14.1.1-7.fc40, 14.1.1-7.fc41

Operating System and version

Fedora 39, 40, rawhide aarch64

TingDaoK commented 4 months ago

Are you consistently seeing those 7 tests failing for all the platforms you listed (Fedora 39, 40 and rawhide on aarch64)? How do you run the tests?

From the stack trace, the crash comes from memory allocation, which mostly caused by out of memory.

graebm commented 4 months ago

Yeah, in that stacktrace it's failing to allocate memory during the aws_http_library_init() call. Which is really weird because that's one of the very first thing an aws-c-http test does. Why would the allocator fail so extremely early?

Are you running all 600+ tests in parallel? I could imagine launching 600+ executables simultaneously might cause some of the later ones to experience malloc failure? But that's really the only guess I can come up with.

PS: The reason the stacktrace is enormous is due to an infinite loop: The memory allocation failure triggers an assert, the assert tries to print the backtrace, this needs to perform allocations, which fail and trigger an assert... loop... loop... loop. But this isn't the cause of the test failures, just some expected behavior when an assertion fails.

wombelix commented 4 months ago

Are you consistently seeing those 7 tests failing for all the platforms you listed (Fedora 39, 40 and rawhide on aarch64)?

Yes, as far I can see always the same 7 tests, nothing random.

How do you run the tests?

I use the Fedora build infrastructure (https://copr.fedoraproject.org)

From the stack trace, the crash comes from memory allocation, which mostly caused by out of memory.

Based on the logs they seem to fail after 0.01 sec, I wonder if that is actually enough time to run out of memory? And why just on a couple of operating systems? In the same build infra, aarch64 on RHEL 8 and RHEL 9 are fine.

Are you running all 600+ tests in parallel? I could imagine launching 600+ executables simultaneously might cause some of the later ones to experience malloc failure? But that's really the only guess I can come up with.

No they run in sequence one after each other as far I can tell from the logs. I let run cmake test without any custom parameters at least.

I can run manual tests on an EC2 Graviton instance if that helps. But given that the tests are fine on RHEL 8 and RHEL 9 in the same build infra on aarch64, but failing on F39, F40 and rawhide gives me the impression it's not a general problem with the build infra / resources. The major difference I see is the OpenSSL version that's installed and used. My guess is that it could therefore related to those issues? But why would it then only affect aarch64 but not x86_64 :-/

graebm commented 4 months ago

yeah ... it is suspicious that it's just those specific tests, all of which share some helper functions. Maybe there's an uninitialized variable in there?

Could you give me steps I could use to reproduce and iterate on this? I'm not familiar with COPR.

I can't find a way to run Fedora 40 directly. I tried to run it on an EC2 Graviton machine, but couldn't find a free Amazon Machine Image (AMI) for Fedora 39, 40, or rawhide. So instead I ran Amazon Linux 2023 on a Graviton machine, and within that I used docker to run the fedora:40 image. I ran the following commands (apoligies if I left something out here):

graebm commented 4 months ago

I dug into what makes those 7 tests different from all the other tests defined in the same file. One difference that stands out, is they all have a large error_tester struct defined in the function, which means it's created on the stack. This struct is 403KiB.

The tests in this file that pass all use a similar s_tester struct, except that one's defined as a global variable, instead of a stack variable.

I've confirmed that the test crashes with a coredump if I set the stack size to 256KiB via ulimit -s 256. But if the stack size is 512KiB the test passes. But I'm not 100% convinced this is the case because when it crashes I just see Segmentation fault (core dumped). But you shared logs where it seemed like some code got a chance to run before crashing...

Maybe when COPR runs aarch Fedora 39, 40, rawhide, the default stack size is smaller than other configurations? Can you check the default stack size on these COPR runs via ulimit -s ?

Also, did this crash start when you updated aws-c-http? or when you updated the fedora versions you're testing? Or is this the first time you're ever testing aws-c-http?

wombelix commented 4 months ago

@graebm thank you for looking into this and doing all the testing. This problem seem to become tricky to reproduce. I'm packaging aws-c-http for Fedora and EPEL. Copr is a Fedora build infrastructure that contributor can use to build and test packages (https://copr.fedorainfracloud.org/coprs/). The leveraging AWS under the hood, aarch64 builds running on c7g.xlarge spot instances. The ulimit returns 8192. The build host runs Fedora 39 and is the same for all aarch64 builds I trigger. Interestingly epel9 is successful, rawhide fails. I'm still looking around if I can find any other relevant difference from a build env perspective. It has either something to do with the buildroot that's used for each variation or boils down to a software package / dependency. As initially mentioned, the major difference seem the OpenSSL version.

[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# cat /etc/os-release 
NAME="Fedora Linux"
VERSION="39 (Cloud Edition)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora Linux 39 (Cloud Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f39/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="Cloud Edition"
VARIANT_ID=cloud

[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 4
  On-line CPU(s) list:  0-3
Vendor ID:              ARM
  BIOS Vendor ID:       AWS
  Model name:           Neoverse-V1
    BIOS Model name:    AWS Graviton3 AWS Graviton3 CPU @ 2.6GHz
    BIOS CPU family:    257
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r1p1
    BogoMIPS:           2100.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilr
                        cpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
Caches (sum of all):    
  L1d:                  256 KiB (4 instances)
  L1i:                  256 KiB (4 instances)
  L2:                   4 MiB (4 instances)
  L3:                   32 MiB (1 instance)
NUMA:                   
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-3
Vulnerabilities:        
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; CSV2, BHB
  Srbds:                Not affected
  Tsx async abort:      Not affected
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# ulimit -s
8192

[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       2.0Gi       4.6Gi       1.5Gi       2.6Gi       5.6Gi
Swap:          143Gi          0B       143Gi

system[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type
c7g.xlarge

[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/placement/availability-zone
us-east-1c

[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/placement/availability-zone-id
use1-az6

[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-life-cycle
spot
wombelix commented 4 months ago

The Fedora Project has another build environment, called Koji, it's used for building the actual distribution packages that become part of the Fedora Repos. There was no issue building the package, including runnint the tests, in this environment. Looks like on-premises hosts with a bit more resource compared to the build machines in Copr.

CPU info:
Architecture:                         aarch64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
CPU(s):                               80
On-line CPU(s) list:                  0-79
Vendor ID:                            ARM
Model name:                           Neoverse-N1
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   80
Socket(s):                            1
Stepping:                             r3p1
Frequency boost:                      disabled
CPU(s) scaling MHz:                   35%
CPU max MHz:                          3000.0000
CPU min MHz:                          1000.0000
BogoMIPS:                             50.00
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache:                            5 MiB (80 instances)
L1i cache:                            5 MiB (80 instances)
L2 cache:                             80 MiB (80 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-79
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; CSV2, BHB
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Memory:
               total        used        free      shared  buff/cache   available
Mem:       394773564     6701480   369350056        5608    22065640   388072084
Swap:        8388604        1792     8386812

Storage (chroot, cache_topdir):
Filesystem                                            Type   Size  Used Avail Use% Mounted on
/dev/mapper/luks-40ff6a63-0029-401b-96da-6c033dcd37b3 btrfs  2.9T   13G  2.9T   1% /
/dev/mapper/luks-40ff6a63-0029-401b-96da-6c033dcd37b3 btrfs  2.9T   13G  2.9T   1% /

I can't find a way to run Fedora 40 directly. I tried to run it on an EC2 Graviton machine, but couldn't find a free Amazon Machine Image (AMI) for Fedora 39, 40, or rawhide.

You can go to https://fedoraproject.org/cloud/download#cloud_launch and launch an instance from there. It opens the AWS Console with the pre-filled AMI ID.

I launched a c7g.xlarge in us-east-1 with F40 and reproduced your steps:

dnf install -y git cmake gcc openssl-devel

# for s2n-tls, aws-c-common, aws-c-io, aws-c-compression, and aws-c-http:
git clone (using latest, I haven't tried the specific commits from that build)
cmake -S $REPO -B build/$REPO -DBUILD_SHARED_LIBS=ON -DCMAKE_PREFIX_PATH=/usr/local -DCMAKE_BUILD_TYPE=Release
cmake --build build/$REPO --target install --parallel

cd build/aws-c-http/tests
./aws-c-http-tests h1_server_close_before_message_is_sent

ctest

Everything went fine, 100% tests passed, 0 tests failed out of 680.

I checked the spec files of aws-c-http and its dependencies that I maintain.

aws-c-common (no -DCMAKE_BUILD_TYPE=Release): https://src.fedoraproject.org/rpms/aws-c-common/blob/rawhide/f/aws-c-common.spec#_46 aws-c-cal (no -DCMAKE_BUILD_TYPE=Release, -DUSE_OPENSSL=ON): https://src.fedoraproject.org/rpms/aws-c-cal/blob/rawhide/f/aws-c-cal.spec#_49 aws-c-io (no -DCMAKE_BUILD_TYPE=Release): https://src.fedoraproject.org/rpms/aws-c-io/blob/rawhide/f/aws-c-io.spec#_51 aws-c-compression (no -DCMAKE_BUILD_TYPE=Release): https://src.fedoraproject.org/rpms/aws-c-compression/blob/rawhide/f/aws-c-compression.spec#_53 s2n-tls (-GNinja): https://src.fedoraproject.org/rpms/s2n-tls/blob/rawhide/f/s2n-tls.spec#_56 aws-c-http (no -DCMAKE_BUILD_TYPE=Release): https://src.fedoraproject.org/rpms/aws-c-http/blob/rawhide/f/aws-c-http.spec#_55

I don't use -DCMAKE_BUILD_TYPE=Release in most cases. I build aws-c-cal with -DUSE_OPENSSL=ON and s2n-tls with -GNinja. I would suspect that using system OpenSSL makes a difference, but no, tests still pass.

Next test, I installed the Fedora Packager Tools (https://docs.fedoraproject.org/en-US/package-maintainers/Installing_Packager_Tools/). I clonded the fedora package (https://src.fedoraproject.org/rpms/aws-c-http.git): git clone https://src.fedoraproject.org/rpms/aws-c-http.git. And triggered a mockbuild: fedpkg --release f40 mockbuild Ends up in the initially reported issue:

99% tests passed, 7 tests failed out of 664

Total Test time (real) =   7.31 sec

The following tests FAILED:
    601 - h1_server_close_before_message_is_sent (Failed)
    602 - h1_server_error_from_incoming_request_callback_stops_decoder (Failed)
    603 - h1_server_error_from_incoming_headers_callback_stops_decoder (Failed)
    604 - h1_server_error_from_incoming_headers_done_callback_stops_decoder (Failed)
    605 - h1_server_error_from_incoming_request_done_callback_stops_decoder (Failed)
    606 - h1_server_error_from_incoming_body_callback_stops_decoder (Failed)
    607 - h1_server_error_from_outgoing_body_callback_stops_sending (Failed)
Errors while running CTest
error: Bad exit status from /var/tmp/rpm-tmp.1ENAiP (%check)

RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.1ENAiP (%check)

Doing a mockbuild with AlmaLinux for EPEL9: mock -r alma+epel-9-aarch64 --resultdir /home/fedora/fedpkg/aws-c-http/results_aws-c-http/0.8.2/1.el9 --rebuild /home/fedora/fedpkg/aws-c-http/aws-c-http-0.8.2-1.el9.src.rpm The result is fine, I'm still on the same c7g.xlarge instance:

100% tests passed, 0 tests failed out of 664

And at this point I'm a bit clueless. In the Koji build environment with build hosts that have more hardware resources = no issue On a c7g.xlarge instance (similar to the one used in the Copr build environment)

The mock configs and templates defining which container image to use, which repos to attach, packages to install and things like that. There is no resource limitation or something like that. Which brings me to the assumption that it can only be software version related.

BUT why are we fine on a Fedora 40 aarch64 host when compiling and running the tests manually but it fails in a F40 container on the same system O_o.

wombelix commented 3 months ago

I updated the package today to the latest release and the behaviour was similar as last time. So the issue is pretty limited to a specific mock environment and hard to re-produce. Overall it works and in the build environment "that counts" the problem doesn't show up either. I'm closing this Issue for now. Thanks for your support!