awslabs / aws-crt-nodejs

NodeJS bindings for the AWS Common Runtime.
Apache License 2.0
40 stars 27 forks source link

Segmentation Fault immediately on require inside Worker threads on Linux #452

Closed mikkopiu closed 1 year ago

mikkopiu commented 1 year ago

Describe the bug

When using Node.js Worker threads, Segmentation fault (core dumped)/SIGSEGV is triggered when aws-crt is imported/loaded, or more specifically: when the native binary is launched.

For me, this first appeared after upgrading a project to AWS SDK JS v3 and a test case run via ava (using Worker threads) started segfaulting immediately when a module that invoked new FirehoseClient({}) was imported (which in turn, imports/uses aws-crt).

Expected Behavior

Expected aws-crt to either throw the exception implemented in #290 or just work when using Worker threads (based on #451 but I might be misunderstanding).

Ideally, I'd be able to run tests using aws-crt concurrently with ava (using Worker threads).

Current Behavior

Immediate Segmentation fault (core dumped) upon require('aws-crt') (or equivalent).

As I'm not too familiar with debugging C(++), my debugging attempts probably contain a lot of red herrings but here are some of my attempts/findings so far:

  1. Using llnode (lldb plugin), the backtrace of the minimal repro at least looks weird:

    $ llnode /usr/bin/node -c /tmp/core.123
    (llnode) v8 bt
    * thread #1: tid = 487, 0x00007f4b814c1450, name = 'node', stop reason = signal SIGSEGV
    * frame #0: 0x00007f4b814c1450
    frame #1: 0x00007f4b8d256df0 libc.so.6`__restore_rt
    frame #2: 0x00007f4b814c1450
    frame #3: 0x00007f4b8d256df0 libc.so.6`__restore_rt
    ... Repeated >5600 times
    frame #5691: 0x00007f4b8d256df0 libc.so.6`__restore_rt
    frame #5692: 0x00007f4b815b7510
    frame #5693: 0x00007f4b8d29e931 libc.so.6`__GI___nptl_deallocate_tsd + 161
    frame #5694: 0x00007f4b8d2a16d6 libc.so.6`start_thread + 422
    frame #5695: 0x00007f4b8d241450 libc.so.6`__clone3 + 48
  2. Trying to run the binary directly with lldb, crashes with SIGSEGV: address access protected:

    $ chmod +x dist/bin/linux-x64/aws-crt-nodejs.node
    $ lldb dist/bin/linux-x64/aws-crt-nodejs.node
    (lldb) run
    Process 4291 launched: '/aws-crt-nodejs/dist/bin/linux-x64/aws-crt-nodejs.node' (x86_64)
    Process 4291 stopped
    * thread #1, name = 'aws-crt-nodejs.', stop reason = signal SIGSEGV: address access protected (fault address: 0x7ffff7a8a000)
        frame #0: 0x00007ffff7a8a000 aws-crt-nodejs.node
    ->  0x7ffff7a8a000: jg     0x7ffff7a8a047
    (lldb) memory read 0x7ffff7a8a000
    0x7ffff7a8a000: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00  .ELF............
    0x7ffff7a8a010: 03 00 3e 00 01 00 00 00 00 00 00 00 00 00 00 00  ..>.............

Reproduction Steps

I've been trying to identify the meaningful variables but at least the most reproducible example is (based on https://github.com/awslabs/aws-crt-nodejs/issues/286#issuecomment-1010148428):

  1. Start an EC2 instance with AMI al2023-ami-2023.0.20230503.0-kernel-6.1-x86_64 (latest Amazon Linux 2023 HVM at the time of writing)
    • or equivalent Linux host, the exact flavour and kernel version don't seem to matter too much (or I might just be really unlucky)
  2. On the host, install Node.js: yum install nodejs (from the built-in repos, it's 18.12.1 at the time of writing)
  3. Core dumps: ulimit -c unlimited
  4. Create repro files and run:

    cd $(mktemp -d)
    echo '{"name": "repro","type": "module","dependencies": {"aws-crt": "1.15.16"}}' > package.json
    npm install
    echo 'import { Worker } from "worker_threads"; const worker = new Worker("./reproWorker.js");' > index.js
    echo 'import "aws-crt";' > reproWorker.js
    node index.js
    # -> Segmentation fault (core dumped)
    • In my attempts, reproduces also with all the versions listed below, and if I built aws-crt from source and required aws-crt-nodejs/dist/index.js (or the linux-x64 binary directly in CommonJS)

Possible Solution

No response

Additional Information/Context

If I'm not mistaken about aws-crt being supposed to work under Worker threads, I guess this is actually an upstream Node.js issue but as mentioned, I'm not really familiar enough with C(++) stuff and Worker threads so I haven't been able to confirm.

Here's all the setups I've been able to reproduce this with:

Versions of aws-crt:

Node.js:

Operating systems:

Memory:

Other:

aws-crt-nodejs version used

1.15.16

nodejs version used

18.12.1

Operating System and version

Amazon Linux 2023, AMI: al2023-ami-2023.0.20230503.0-kernel-6.1-x86_64, uname -a: Linux hostname 6.1.25-37.47.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Apr 24 23:20:16 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

bretambrose commented 1 year ago

This is the thread local storage crash issue mentioned near the bottom of this: https://github.com/aws/aws-iot-device-sdk-js-v2/discussions/360

Current plan is to push through the linked s2n patches and switch from aws-lc to openssl's libcrypto which doesn't have a thread local storage destruction problem. I don't have an ETA atm.

klindklind commented 1 year ago

We have exactly the same problem currently which is blocking us from upgrading from aws-sdk v2 -> v3. Hopefully we get a fix soon šŸ™

ruvenzx commented 1 year ago

Same issue reproduced, when running node 17.7 on ARM64 using aws-sdk/client-cognito-identity-provider package which indeed calls aws-crt and causes a SIGSEGV. (I specifically run this on Docker alpine, error: EXITED(139)).

xer0x commented 1 year ago

+1 šŸ™ this has been very vexing for our team! Thank you for investigating! This has broken our AWS-CDK build process.

bretambrose commented 1 year ago

https://github.com/awslabs/aws-crt-nodejs/releases/tag/v1.15.19 should fix this crash.

We will update the v2 IOT SDK for Javascript shortly. For other dependency updates, please contact the maintainer of the package directly.

mikkopiu commented 1 year ago

Can confirm that it fixed the crash in both my minimal repro & our test setup with aws-sdk-js-v3, and for aws-sdk-js-v3 packages, at least on version 3.348.0, their nested dependency ranges allow in-place upgrades to v1.15.19 šŸŽ‰

Thank you šŸ‘