Memory corruption when running the Hurl CLI with `--test` mode

lambrospetrou commented 1 month ago

What is the current bug behavior?

There is a corrupt memory heap operation somewhere when finalizing the Hurl run. I am getting the following errors every few hundreds of runs of a simple Hurl script:

GET https://unpkg.com/vue@3.4.27/dist/vue.global.prod.js
HTTP 200

Errors occurring:

hurl --test --repeat 100 vuejs.hurl
...

Executed files:    100
Executed requests: 100 (91.6/s)
Succeeded files:   100 (100.0%)
Failed files:      0 (0.0%)
Duration:          1092 ms

double free or corruption (fasttop)
double free or corruption (fasttop)
double free or corruption (fasttop)
Segmentation fault (core dumped)

One time I got this error instead as well malloc_consolidate(): unaligned fastbin chunk detected, which is memory related too.

Steps to reproduce

Create the file Dockerfile-hurl-memory-issue with contents:

# Hurl needs GLibc 2.28+. We want curl version 8.4.0+ as well
# to properly support --max-filesize: https://github.com/Orange-OpenSource/hurl/issues/3245
# https://launchpad.net/ubuntu/+source/curl
FROM ubuntu:noble

# Hurl dependencies.
RUN apt-get update && apt-get install --no-install-recommends -y ca-certificates curl libcurl4 libxml2

# Install Hurl as per docs: https://hurl.dev/docs/installation.html#debian-ubuntu
RUN VERSION=5.0.1 && curl --location --remote-name https://github.com/Orange-OpenSource/hurl/releases/download/$VERSION/hurl_${VERSION}_amd64.deb && apt update && apt install ./hurl_${VERSION}_amd64.deb

Build the image and get inside the container:

docker build -t hurl-memory-test -f Dockerfile-hurl-memory-issue . && docker run --rm -it --entrypoint /bin/bash hurl-memory-test

Create the Hurl file:

echo -e "GET https://unpkg.com/vue@3.4.27/dist/vue.global.prod.js\nHTTP 200" > vuejs.hurl

Run the Hurl file (you might need to repeat the following a few times till you cause the memory error):

hurl --test --repeat 100 vuejs.hurl

What is the expected correct behavior?

It should complete successfully without any errors.

Execution context

Hurl Version (hurl --version):

root@f2abb0a5ee0c:/# hurl --version
hurl 5.0.1 (x86_64-pc-linux-gnu) libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libssh/0.10.6/openssl/zlib nghttp2/1.59.0
Features (libcurl):  alt-svc AsynchDNS brotli HSTS HTTP2 IDN IPv6 Largefile libz NTLM SPNEGO SSL TLS-SRP UnixSockets zstd
Features (built-in): brotli

The issue occurs on ubuntu:noble latest tag as of today: https://hub.docker.com/layers/library/ubuntu/noble-20240904.1/images/sha256-74f92a6b3589aa5cac6028719aaac83de4037bad4371ae79ba362834389035aa?context=explore

And also happens with debian:trixie-slim: https://hub.docker.com/layers/library/debian/trixie-slim/images/sha256-4cc21783b221c03e841bc1274e5babef8c52ccabfc20abadc37e83e37bed6990?context=explore

Possible fixes

Not sure...

lambrospetrou commented 1 month ago

To add more context, the issue is not related to --repeat. In Skybear.NET I run scripts without that option, hence only once, and every few days I get the above memory corruption errors. Today for example, it happened 3 times.

fabricereix commented 1 month ago

Hi @lambrospetrou, thanks for reporting the issue. I don't reproduce it yet on my local machine with your steps. We are going to try to reproduce it in our pipeline, and we'll come back to you.

lambrospetrou commented 1 month ago

Hi @lambrospetrou, thanks for reporting the issue. I don't reproduce it yet on my local machine with your steps. We are going to try to reproduce it in our pipeline, and we'll come back to you.

Yeah, sometimes it needs a few attempts to get it crashing. I always get it happening if I run the hurl ... command in a bash loop or run it fast enough a few times.

jcamiel commented 1 month ago

Hi all,

I tried to reproduce the bug on my machine (MBP M1), I didn't reproduce it either in local (with brew's Hurl), or in a Docker container (with instructions provided by @lambrospetrou ). I've tried hard 😅! In the case of Docker, I've not used the .deb file (its only for x86 arch), but I've installed the arm64 precompiled binary.

That's said, I've run the sample through valgrind and got this result:

$ cargo valgrind run -p hurl -- --test /tmp/vuejs.hurl 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `/root/.cargo/bin/cargo-valgrind target/debug/hurl --test --jobs 1 /tmp/vuejs.hurl`
       Error leaked 384 B in 1 block
        Info stack trace (user code at the bottom)
             at calloc
             at calloc (rtld-malloc.h:44)
             at allocate_dtv (dl-tls.c:370)
             at _dl_allocate_tls (dl-tls.c:629)
             at allocate_stack (allocatestack.c:429)
             at pthread_create@@GLIBC_2.34 (pthread_create.c:655)
             at std::sys::pal::unix::thread::Thread::new (thread.rs:87)
             at std::thread::Builder::spawn_unchecked_ (mod.rs:577)
             at std::thread::Builder::spawn_unchecked (mod.rs:456)
             at spawn<hurl::parallel::worker::{impl#3}::new::{closure_env#0}, core::result::Result<(), std::sync::mpsc::SendError<hurl::parallel::message::WorkerMessage>>> (mod.rs:388)
             at std::thread::spawn (mod.rs:694)
             at hurl::parallel::worker::Worker::new (worker.rs:75)
             at hurl::parallel::runner::ParallelRunner::new::{{closure}} (runner.rs:127)
             at core::iter::adapters::map::map_fold::{{closure}} (map.rs:89)
             at core::iter::traits::iterator::Iterator::fold (iterator.rs:2588)
     Summary Leaked 384 B total (0 other errors)

So there is one thread leaked detected by valgrind. Not sure if this is related but we should fix this.

@lambrospetrou can you try to reproduce the issue by not using --test? This way, the code will use the sequential run path and we will see if the issue is with the parallel implementation or not. If these errors start happening with 5.0.1, this could explain why...

lambrospetrou commented 1 month ago

I cannot reproduce without the --test option. I tried it back when I opened the issue, and it only happens when using the --test option, also confirmed it again now.

I am attaching a video showcasing that it happens every few times on my side with the --test option 😄 Check the 6th run, 11th run, and the last two.

https://github.com/user-attachments/assets/f6c52a9b-1bdb-4604-ad0f-1685388f66ca

Also, when using only 1 thread with hurl --test --repeat 100 --jobs 1 vuejs.hurl it also works fine, I cannot make it fail. So, indeed it seems to be something with the parallel mode.

jcamiel commented 1 month ago

Your Docker host is a Windows machine I'm correct? I will try to reproduce the bug from a Windows machine (with Docker etc). You're reproducing it quite easily while I can't make it, it maybe make a difference? (It doesn't explain why it happens on skybear.net though)

lambrospetrou commented 1 month ago

Your Docker host is a Windows machine I'm correct? I will try to reproduce the bug from a Windows machine (with Docker etc). You're reproducing it quite easily while I can't make it, it maybe make a difference? (It doesn't explain why it happens on skybear.net though)

Locally, I am using WSL2 (Ubuntu 24.04.1 LTS) in Windows 11, and using Docker Desktop on the Windows side.

For Skybear.NET, I use Fly.io and my container is based off ubuntu:noble.

fabricereix commented 1 month ago

I have added an integ test that combines --test and --repeat https://github.com/Orange-OpenSource/hurl/pull/3305 The bug is still not reproduced :-(

jcamiel commented 1 month ago

I haven't managed to reproduce it, but I think there's something I would make a little cleaner in the graceful shutdown of threads workers. I don't know if it will fix this issue or not but it will be a cleaner and it's worth trying (plus it can fix the valgring issue that tickles me). As soon as I've time to do it !

lambrospetrou commented 1 month ago

For the record, I have retries on Skybear.NET now, so that when it happens again it just retries the script execution, so it's not urgent at the moment. It would be nice though to see what in the parallelism could cause race conditions on memory. I might take a stub in reading some Rust later in the week 😏

jcamiel commented 1 month ago

Hi @lambrospetrou I've made a branch with a better code for shutting down threads (following the Rust book sample until the end 😇 ). At least, valgrind doesn't raise error any more now. ~~Would you be able to pull a branch and build Hurl in order to reproduce (or hopefully not reproduce) the error ? (the branch is https://github.com/Orange-OpenSource/hurl/tree/3297-memory-corruption-when-running-the-hurl-cli-with---test-mode) If it's too complicated for you, I'll see with @lepapareil how we can build intermediate binaries artefacts to be tested without building on your own~~

Update: I will add here the artefacts for linux so you don't need to rebuild Hurl.

release-generic-linux-x64-artifacts.zip

release-deb-x64-artifacts.zip

lambrospetrou commented 1 month ago

@jcamiel Good news so far. I tested it now with the release-deb-x64-artifacts.zip from above, and could not reproduce a crash. I ran the following a few times and no errors.

for i in {1..100}; do hurl --test --repeat 100 vuejs.hurl || { echo "Error at iteration $i" >&2; break; }; done

Dockerfile used:

# Tested both.
# FROM noble
FROM debian:trixie-slim

RUN apt-get update && apt-get install --no-install-recommends -y ca-certificates curl libcurl4 libxml2

# Install Hurl.
COPY ./hurl_6.0.0-SNAPSHOT_amd64.deb /app/hurl_6.0.0-SNAPSHOT_amd64.deb
RUN apt update && apt install /app/hurl_6.0.0-SNAPSHOT_amd64.deb

So, I think we can say the fix in https://github.com/Orange-OpenSource/hurl/pull/3321/files works (although I haven't gone deep to understand the root cause yet of that fix :) ).

jcamiel commented 1 month ago

Yes, we were quitting the program as soon as all files have been executed, without insuring that every thread was properly finished. I thought it was OK, since the program is quitting anyway but, apparently, it's recommended to insure threads are well stopped before existing program (in C/C++ or Rust at least)

(The sample program, from the official Rust book, on multithreading is clear on it also). We'll add also a Valgrind check that could have detected it. I'm rather confident that the bug is fixed now!

Thanks for detecting it, it was a nasty bug!

Orange-OpenSource / hurl