Closed wombelix closed 3 months ago
Are you consistently seeing those 7 tests failing for all the platforms you listed (Fedora 39, 40 and rawhide on aarch64)? How do you run the tests?
From the stack trace, the crash comes from memory allocation, which mostly caused by out of memory.
Yeah, in that stacktrace it's failing to allocate memory during the aws_http_library_init()
call. Which is really weird because that's one of the very first thing an aws-c-http test does. Why would the allocator fail so extremely early?
Are you running all 600+ tests in parallel? I could imagine launching 600+ executables simultaneously might cause some of the later ones to experience malloc failure? But that's really the only guess I can come up with.
PS: The reason the stacktrace is enormous is due to an infinite loop: The memory allocation failure triggers an assert, the assert tries to print the backtrace, this needs to perform allocations, which fail and trigger an assert... loop... loop... loop. But this isn't the cause of the test failures, just some expected behavior when an assertion fails.
Are you consistently seeing those 7 tests failing for all the platforms you listed (Fedora 39, 40 and rawhide on aarch64)?
Yes, as far I can see always the same 7 tests, nothing random.
How do you run the tests?
I use the Fedora build infrastructure (https://copr.fedoraproject.org)
From the stack trace, the crash comes from memory allocation, which mostly caused by out of memory.
Based on the logs they seem to fail after 0.01 sec
, I wonder if that is actually enough time to run out of memory? And why just on a couple of operating systems? In the same build infra, aarch64 on RHEL 8 and RHEL 9 are fine.
Are you running all 600+ tests in parallel? I could imagine launching 600+ executables simultaneously might cause some of the later ones to experience malloc failure? But that's really the only guess I can come up with.
No they run in sequence one after each other as far I can tell from the logs. I let run cmake test without any custom parameters at least.
I can run manual tests on an EC2 Graviton instance if that helps. But given that the tests are fine on RHEL 8 and RHEL 9 in the same build infra on aarch64, but failing on F39, F40 and rawhide gives me the impression it's not a general problem with the build infra / resources. The major difference I see is the OpenSSL version that's installed and used. My guess is that it could therefore related to those issues? But why would it then only affect aarch64 but not x86_64 :-/
yeah ... it is suspicious that it's just those specific tests, all of which share some helper functions. Maybe there's an uninitialized variable in there?
Could you give me steps I could use to reproduce and iterate on this? I'm not familiar with COPR.
I can't find a way to run Fedora 40 directly. I tried to run it on an EC2 Graviton machine, but couldn't find a free Amazon Machine Image (AMI) for Fedora 39, 40, or rawhide. So instead I ran Amazon Linux 2023 on a Graviton machine, and within that I used docker to run the fedora:40
image. I ran the following commands (apoligies if I left something out here):
I dug into what makes those 7 tests different from all the other tests defined in the same file. One difference that stands out, is they all have a large error_tester struct defined in the function, which means it's created on the stack. This struct is 403KiB.
The tests in this file that pass all use a similar s_tester struct, except that one's defined as a global variable, instead of a stack variable.
I've confirmed that the test crashes with a coredump if I set the stack size to 256KiB via ulimit -s 256
. But if the stack size is 512KiB the test passes. But I'm not 100% convinced this is the case because when it crashes I just see Segmentation fault (core dumped)
. But you shared logs where it seemed like some code got a chance to run before crashing...
Maybe when COPR runs aarch Fedora 39, 40, rawhide, the default stack size is smaller than other configurations? Can you check the default stack size on these COPR runs via ulimit -s
?
Also, did this crash start when you updated aws-c-http? or when you updated the fedora versions you're testing? Or is this the first time you're ever testing aws-c-http?
@graebm thank you for looking into this and doing all the testing. This problem seem to become tricky to reproduce. I'm packaging aws-c-http
for Fedora and EPEL. Copr is a Fedora build infrastructure that contributor can use to build and test packages (https://copr.fedorainfracloud.org/coprs/). The leveraging AWS under the hood, aarch64
builds running on c7g.xlarge
spot instances. The ulimit returns 8192
. The build host runs Fedora 39 and is the same for all aarch64
builds I trigger. Interestingly epel9
is successful, rawhide
fails. I'm still looking around if I can find any other relevant difference from a build env perspective. It has either something to do with the buildroot that's used for each variation or boils down to a software package / dependency. As initially mentioned, the major difference seem the OpenSSL version.
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# cat /etc/os-release
NAME="Fedora Linux"
VERSION="39 (Cloud Edition)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora Linux 39 (Cloud Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f39/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="Cloud Edition"
VARIANT_ID=cloud
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
BIOS Vendor ID: AWS
Model name: Neoverse-V1
BIOS Model name: AWS Graviton3 AWS Graviton3 CPU @ 2.6GHz
BIOS CPU family: 257
Model: 1
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilr
cpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
Caches (sum of all):
L1d: 256 KiB (4 instances)
L1i: 256 KiB (4 instances)
L2: 4 MiB (4 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Mitigation; CSV2, BHB
Srbds: Not affected
Tsx async abort: Not affected
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# ulimit -s
8192
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# free -h
total used free shared buff/cache available
Mem: 7.6Gi 2.0Gi 4.6Gi 1.5Gi 2.6Gi 5.6Gi
Swap: 143Gi 0B 143Gi
system[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type
c7g.xlarge
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/placement/availability-zone
us-east-1c
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/placement/availability-zone-id
use1-az6
[root@aws-aarch64-spot-prod-08533740-20240730-083508 ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-life-cycle
spot
The Fedora Project has another build environment, called Koji, it's used for building the actual distribution packages that become part of the Fedora Repos. There was no issue building the package, including runnint the tests, in this environment. Looks like on-premises hosts with a bit more resource compared to the build machines in Copr.
CPU info:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 80
Socket(s): 1
Stepping: r3p1
Frequency boost: disabled
CPU(s) scaling MHz: 35%
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache: 5 MiB (80 instances)
L1i cache: 5 MiB (80 instances)
L2 cache: 80 MiB (80 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-79
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Memory:
total used free shared buff/cache available
Mem: 394773564 6701480 369350056 5608 22065640 388072084
Swap: 8388604 1792 8386812
Storage (chroot, cache_topdir):
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/luks-40ff6a63-0029-401b-96da-6c033dcd37b3 btrfs 2.9T 13G 2.9T 1% /
/dev/mapper/luks-40ff6a63-0029-401b-96da-6c033dcd37b3 btrfs 2.9T 13G 2.9T 1% /
I can't find a way to run Fedora 40 directly. I tried to run it on an EC2 Graviton machine, but couldn't find a free Amazon Machine Image (AMI) for Fedora 39, 40, or rawhide.
You can go to https://fedoraproject.org/cloud/download#cloud_launch and launch an instance from there. It opens the AWS Console with the pre-filled AMI ID.
I launched a c7g.xlarge
in us-east-1
with F40 and reproduced your steps:
dnf install -y git cmake gcc openssl-devel
# for s2n-tls, aws-c-common, aws-c-io, aws-c-compression, and aws-c-http:
git clone (using latest, I haven't tried the specific commits from that build)
cmake -S $REPO -B build/$REPO -DBUILD_SHARED_LIBS=ON -DCMAKE_PREFIX_PATH=/usr/local -DCMAKE_BUILD_TYPE=Release
cmake --build build/$REPO --target install --parallel
cd build/aws-c-http/tests
./aws-c-http-tests h1_server_close_before_message_is_sent
ctest
Everything went fine, 100% tests passed, 0 tests failed out of 680
.
I checked the spec files of aws-c-http and its dependencies that I maintain.
aws-c-common (no -DCMAKE_BUILD_TYPE=Release
): https://src.fedoraproject.org/rpms/aws-c-common/blob/rawhide/f/aws-c-common.spec#_46
aws-c-cal (no -DCMAKE_BUILD_TYPE=Release
, -DUSE_OPENSSL=ON
): https://src.fedoraproject.org/rpms/aws-c-cal/blob/rawhide/f/aws-c-cal.spec#_49
aws-c-io (no -DCMAKE_BUILD_TYPE=Release
): https://src.fedoraproject.org/rpms/aws-c-io/blob/rawhide/f/aws-c-io.spec#_51
aws-c-compression (no -DCMAKE_BUILD_TYPE=Release
): https://src.fedoraproject.org/rpms/aws-c-compression/blob/rawhide/f/aws-c-compression.spec#_53
s2n-tls (-GNinja
): https://src.fedoraproject.org/rpms/s2n-tls/blob/rawhide/f/s2n-tls.spec#_56
aws-c-http (no -DCMAKE_BUILD_TYPE=Release
): https://src.fedoraproject.org/rpms/aws-c-http/blob/rawhide/f/aws-c-http.spec#_55
I don't use -DCMAKE_BUILD_TYPE=Release
in most cases.
I build aws-c-cal
with -DUSE_OPENSSL=ON
and s2n-tls
with -GNinja
.
I would suspect that using system OpenSSL makes a difference, but no, tests still pass.
Next test, I installed the Fedora Packager Tools (https://docs.fedoraproject.org/en-US/package-maintainers/Installing_Packager_Tools/).
I clonded the fedora package (https://src.fedoraproject.org/rpms/aws-c-http.git): git clone https://src.fedoraproject.org/rpms/aws-c-http.git
.
And triggered a mockbuild: fedpkg --release f40 mockbuild
Ends up in the initially reported issue:
99% tests passed, 7 tests failed out of 664
Total Test time (real) = 7.31 sec
The following tests FAILED:
601 - h1_server_close_before_message_is_sent (Failed)
602 - h1_server_error_from_incoming_request_callback_stops_decoder (Failed)
603 - h1_server_error_from_incoming_headers_callback_stops_decoder (Failed)
604 - h1_server_error_from_incoming_headers_done_callback_stops_decoder (Failed)
605 - h1_server_error_from_incoming_request_done_callback_stops_decoder (Failed)
606 - h1_server_error_from_incoming_body_callback_stops_decoder (Failed)
607 - h1_server_error_from_outgoing_body_callback_stops_sending (Failed)
Errors while running CTest
error: Bad exit status from /var/tmp/rpm-tmp.1ENAiP (%check)
RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.1ENAiP (%check)
Doing a mockbuild with AlmaLinux for EPEL9: mock -r alma+epel-9-aarch64 --resultdir /home/fedora/fedpkg/aws-c-http/results_aws-c-http/0.8.2/1.el9 --rebuild /home/fedora/fedpkg/aws-c-http/aws-c-http-0.8.2-1.el9.src.rpm
The result is fine, I'm still on the same c7g.xlarge
instance:
100% tests passed, 0 tests failed out of 664
And at this point I'm a bit clueless.
In the Koji build environment with build hosts that have more hardware resources = no issue
On a c7g.xlarge
instance (similar to the one used in the Copr build environment)
The mock configs and templates defining which container image to use, which repos to attach, packages to install and things like that. There is no resource limitation or something like that. Which brings me to the assumption that it can only be software version related.
BUT why are we fine on a Fedora 40 aarch64 host when compiling and running the tests manually but it fails in a F40 container on the same system O_o.
I updated the package today to the latest release and the behaviour was similar as last time. So the issue is pretty limited to a specific mock environment and hard to re-produce. Overall it works and in the build environment "that counts" the problem doesn't show up either. I'm closing this Issue for now. Thanks for your support!
Describe the bug
Hi,
I'm packaging aws-c-http for Fedora and EPEL. I run the unit tests as part of the build process. 7 tests fail on Fedora (39, 49, rawhide) on aarch64. They run successful on x86_64. Also EPEL8 (RHEL8) and EPEL9 (RHEL9) is fine on both architectures.
I suspect that the problem is related to the OpenSSL version and a behaviour specific to aarch64. I'm looking for help how to further troubleshoot and fix the issue. I'm not an expert in C and have a hard time to make sense out of the stacktrace.
I also packaged the dependencies like aws-c-common (https://src.fedoraproject.org/rpms/aws-c-common) or s2n-tls (https://src.fedoraproject.org/rpms/s2n-tls). All of them use
BUILD_SHARED_LIBS=ON
. I don't useaws-lc
, I build against the system crypto OpenSSL instead.Expected Behavior
Passed unit tests on Fedora 39, 40 and rawhide on aarch64, similar as on x86_64
Current Behavior
Works
EPEL8:
EPEL9:
Fails
F39:
F40:
F41:
The latest copr build for all architectures and OS versions: https://copr.fedorainfracloud.org/coprs/wombelix/aws-c-libs/build/7722282/
The relevant log outputs from
fedora-rawhide-x86_64
:Mapping of the failed test to its location in the aws-c-http source:
The following tests FAILED: 601 - h1_server_close_before_message_is_sent (Failed) https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L593 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1591
https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L594 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1629
https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L595 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1635
https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L596 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1641
https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L597 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1653
https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L598 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1647
https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/CMakeLists.txt#L599 https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1658
All seem to have in common that they use:
s_test_error_from_callback https://github.com/awslabs/aws-c-http/blob/079ccfd253a2ac24f6c1998881911876f8d1c631/tests/test_h1_server.c#L1523
Stracktrace attached, too long to add it inline:
stacktrace_aws-c-http_f41_rawhide.txt
Reproduction Steps
Build and run the unit tests on a Fedora 30, 40 or rawhide aarch64 system.
Possible Solution
No response
Additional Information/Context
No response
aws-c-http version used
0.8.2
Compiler and version used
13.3.1-1.fc39, 14.1.1-7.fc40, 14.1.1-7.fc41
Operating System and version
Fedora 39, 40, rawhide aarch64