Orange-OpenSource / hurl

Hurl, run and test HTTP requests with plain text.
https://hurl.dev
Apache License 2.0
12.93k stars 484 forks source link

Memory corruption when running the Hurl CLI with `--test` mode #3297

Open lambrospetrou opened 1 week ago

lambrospetrou commented 1 week ago

What is the current bug behavior?

There is a corrupt memory heap operation somewhere when finalizing the Hurl run. I am getting the following errors every few hundreds of runs of a simple Hurl script:

GET https://unpkg.com/vue@3.4.27/dist/vue.global.prod.js
HTTP 200

Errors occurring:

hurl --test --repeat 100 vuejs.hurl
...

Executed files:    100
Executed requests: 100 (91.6/s)
Succeeded files:   100 (100.0%)
Failed files:      0 (0.0%)
Duration:          1092 ms

double free or corruption (fasttop)
double free or corruption (fasttop)
double free or corruption (fasttop)
Segmentation fault (core dumped)

One time I got this error instead as well malloc_consolidate(): unaligned fastbin chunk detected, which is memory related too.

Steps to reproduce

Create the file Dockerfile-hurl-memory-issue with contents:

# Hurl needs GLibc 2.28+. We want curl version 8.4.0+ as well
# to properly support --max-filesize: https://github.com/Orange-OpenSource/hurl/issues/3245
# https://launchpad.net/ubuntu/+source/curl
FROM ubuntu:noble

# Hurl dependencies.
RUN apt-get update && apt-get install --no-install-recommends -y ca-certificates curl libcurl4 libxml2

# Install Hurl as per docs: https://hurl.dev/docs/installation.html#debian-ubuntu
RUN VERSION=5.0.1 && curl --location --remote-name https://github.com/Orange-OpenSource/hurl/releases/download/$VERSION/hurl_${VERSION}_amd64.deb && apt update && apt install ./hurl_${VERSION}_amd64.deb

Build the image and get inside the container:

docker build -t hurl-memory-test -f Dockerfile-hurl-memory-issue . && docker run --rm -it --entrypoint /bin/bash hurl-memory-test

Create the Hurl file:

echo -e "GET https://unpkg.com/vue@3.4.27/dist/vue.global.prod.js\nHTTP 200" > vuejs.hurl

Run the Hurl file (you might need to repeat the following a few times till you cause the memory error):

hurl --test --repeat 100 vuejs.hurl

What is the expected correct behavior?

It should complete successfully without any errors.

Execution context

root@f2abb0a5ee0c:/# hurl --version
hurl 5.0.1 (x86_64-pc-linux-gnu) libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libssh/0.10.6/openssl/zlib nghttp2/1.59.0
Features (libcurl):  alt-svc AsynchDNS brotli HSTS HTTP2 IDN IPv6 Largefile libz NTLM SPNEGO SSL TLS-SRP UnixSockets zstd
Features (built-in): brotli

The issue occurs on ubuntu:noble latest tag as of today: https://hub.docker.com/layers/library/ubuntu/noble-20240904.1/images/sha256-74f92a6b3589aa5cac6028719aaac83de4037bad4371ae79ba362834389035aa?context=explore

And also happens with debian:trixie-slim: https://hub.docker.com/layers/library/debian/trixie-slim/images/sha256-4cc21783b221c03e841bc1274e5babef8c52ccabfc20abadc37e83e37bed6990?context=explore

Possible fixes

Not sure...

lambrospetrou commented 1 week ago

To add more context, the issue is not related to --repeat. In Skybear.NET I run scripts without that option, hence only once, and every few days I get the above memory corruption errors. Today for example, it happened 3 times.

fabricereix commented 5 days ago

Hi @lambrospetrou, thanks for reporting the issue. I don't reproduce it yet on my local machine with your steps. We are going to try to reproduce it in our pipeline, and we'll come back to you.

lambrospetrou commented 5 days ago

Hi @lambrospetrou, thanks for reporting the issue. I don't reproduce it yet on my local machine with your steps. We are going to try to reproduce it in our pipeline, and we'll come back to you.

Yeah, sometimes it needs a few attempts to get it crashing. I always get it happening if I run the hurl ... command in a bash loop or run it fast enough a few times.

jcamiel commented 3 days ago

Hi all,

I tried to reproduce the bug on my machine (MBP M1), I didn't reproduce it either in local (with brew's Hurl), or in a Docker container (with instructions provided by @lambrospetrou ). I've tried hard 😅! In the case of Docker, I've not used the .deb file (its only for x86 arch), but I've installed the arm64 precompiled binary.

That's said, I've run the sample through valgrind and got this result:

$ cargo valgrind run -p hurl -- --test /tmp/vuejs.hurl 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `/root/.cargo/bin/cargo-valgrind target/debug/hurl --test --jobs 1 /tmp/vuejs.hurl`
       Error leaked 384 B in 1 block
        Info stack trace (user code at the bottom)
             at calloc
             at calloc (rtld-malloc.h:44)
             at allocate_dtv (dl-tls.c:370)
             at _dl_allocate_tls (dl-tls.c:629)
             at allocate_stack (allocatestack.c:429)
             at pthread_create@@GLIBC_2.34 (pthread_create.c:655)
             at std::sys::pal::unix::thread::Thread::new (thread.rs:87)
             at std::thread::Builder::spawn_unchecked_ (mod.rs:577)
             at std::thread::Builder::spawn_unchecked (mod.rs:456)
             at spawn<hurl::parallel::worker::{impl#3}::new::{closure_env#0}, core::result::Result<(), std::sync::mpsc::SendError<hurl::parallel::message::WorkerMessage>>> (mod.rs:388)
             at std::thread::spawn (mod.rs:694)
             at hurl::parallel::worker::Worker::new (worker.rs:75)
             at hurl::parallel::runner::ParallelRunner::new::{{closure}} (runner.rs:127)
             at core::iter::adapters::map::map_fold::{{closure}} (map.rs:89)
             at core::iter::traits::iterator::Iterator::fold (iterator.rs:2588)
     Summary Leaked 384 B total (0 other errors)

So there is one thread leaked detected by valgrind. Not sure if this is related but we should fix this.

@lambrospetrou can you try to reproduce the issue by not using --test? This way, the code will use the sequential run path and we will see if the issue is with the parallel implementation or not. If these errors start happening with 5.0.1, this could explain why...

lambrospetrou commented 3 days ago

I cannot reproduce without the --test option. I tried it back when I opened the issue, and it only happens when using the --test option, also confirmed it again now.

I am attaching a video showcasing that it happens every few times on my side with the --test option 😄 Check the 6th run, 11th run, and the last two.

https://github.com/user-attachments/assets/f6c52a9b-1bdb-4604-ad0f-1685388f66ca

Also, when using only 1 thread with hurl --test --repeat 100 --jobs 1 vuejs.hurl it also works fine, I cannot make it fail. So, indeed it seems to be something with the parallel mode.

jcamiel commented 3 days ago

Your Docker host is a Windows machine I'm correct? I will try to reproduce the bug from a Windows machine (with Docker etc). You're reproducing it quite easily while I can't make it, it maybe make a difference? (It doesn't explain why it happens on skybear.net though)

lambrospetrou commented 3 days ago

Your Docker host is a Windows machine I'm correct? I will try to reproduce the bug from a Windows machine (with Docker etc). You're reproducing it quite easily while I can't make it, it maybe make a difference? (It doesn't explain why it happens on skybear.net though)

Locally, I am using WSL2 (Ubuntu 24.04.1 LTS) in Windows 11, and using Docker Desktop on the Windows side.

For Skybear.NET, I use Fly.io and my container is based off ubuntu:noble.

fabricereix commented 2 days ago

I have added an integ test that combines --test and --repeat https://github.com/Orange-OpenSource/hurl/pull/3305 The bug is still not reproduced :-(

jcamiel commented 2 days ago

I haven't managed to reproduce it, but I think there's something I would make a little cleaner in the graceful shutdown of threads workers. I don't know if it will fix this issue or not but it will be a cleaner and it's worth trying (plus it can fix the valgring issue that tickles me). As soon as I've time to do it !

lambrospetrou commented 1 day ago

For the record, I have retries on Skybear.NET now, so that when it happens again it just retries the script execution, so it's not urgent at the moment. It would be nice though to see what in the parallelism could cause race conditions on memory. I might take a stub in reading some Rust later in the week 😏