Large `Args` struct causes stack overflow on Intel SGX

trevor-crypto commented 1 year ago

Please complete the following tasks

[X] I have searched the discussions
[X] I have searched the open and rejected issues

Rust Version

rustc 1.67.0-nightly (a28f3c88e 2022-11-20)

Clap Version

4.0.27

Minimal reproducible code

Too large to show paste: https://gist.github.com/rust-play/708d645d1c26354657fb8df945718ccc

.cargo/config :

[target.x86_64-fortanix-unknown-sgx]
runner = "ftxsgx-runner-cargo"
[build]
target = "x86_64-fortanix-unknown-sgx"

Cargo.toml:

[dependencies]
clap = { version = "4.0.27", default-features = false,  features = ["std", "derive", "help", "usage", "error-context"] }

[package.metadata.fortanix-sgx]
heap-size = 0x10000000
stack-size = 0x100000

Steps to reproduce the bug with the above code

Have a working environment to compile and run against the x86_64-fortanix-unknown-sgx target
cargo run -- --help using the minimal reproducible code above
Notice the SIGBUS triggered
Now try cargo run --release -- --help
Notice all is normal (help is shown)

Actual Behaviour

     Running `ftxsgx-runner-cargo target/x86_64-fortanix-unknown-sgx/debug/clap-test`
SIGBUS triggered: likely caused by stack overflow in enclave.
ERROR: while running "ftxsgx-runner" "target/x86_64-fortanix-unknown-sgx/debug/clap-test.sgxs" got signal: 7

Expected Behaviour

The --debug profile should work correctly, as --release does, and not trigger a stack overflow.

Additional Context

I know this is an obscure issue due to the fact that the amount of args is insanely large, but I actually have a use-case that is similar to this since with Fortanix Intel SGX, you do not have I/O for a config file, or the ability to use environment variables. I have already increased stack-size, but it doesn't seem to help after a certain point.

I don't expect this to be addressed, so it would be great to just have some insight into how to "please" clap (clap's parser?) so that it doesn't stack overflow.

Debug Output

None, unfortunately.

epage commented 1 year ago

Happen to be able to get a crash dump? I'm curious which code hits a stack overflow.

clap::Arg is fairly big. I hope to shrink it as part of https://github.com/clap-rs/clap/discussions/3476.

rustc and llvm don't do an efficient job with removing stack copies, like with types like clap::Arg: https://arewestackefficientyet.com/

If this is when we read ArgMatches to populate the struct, then I have no idea what could be improved about that.

EDIT: I tried to reproduce this on x64 Linux but the compile times for hundreds of args was just becoming too much for me.

epage commented 1 year ago

I suspect a workaround would be to break the Args up into multiple structs and use #[command(flatten)]. That should force stack space to get reclaimed.

trevor-crypto commented 1 year ago

Thanks for the quick response @epage. I will try your suggestion of breaking up the args into structs then flattening.

Here is a stacktrace:

(gdb) bt
#0  0x00007fffc00b2456 in clap_test::{impl#5}::augment_args (__clap_app=...) at src/main.rs:3
#1  0x00007fffc0073629 in clap_test::{impl#3}::command () at src/main.rs:3
#2  0x00007fffc0061010 in clap::derive::Parser::parse<clap_test::A> ()
    at ~/.cargo/registry/src/github.com-1ecc6299db9ec823/clap-4.0.27/src/derive.rs:82
#3  0x00007fffc0070cb1 in clap_test::main () at src/main.rs:349

Seems like it comes from clap::Args derive macro generation of augment_args?

epage commented 1 year ago

Huh, so it is augment_args. Nothing is being created on the stack. We do have temporaries being created for passing into functions. There are likely also a lot of moves happening.

I wonder if there is a way to restructure things to reduce all of this.

trevor-crypto commented 1 year ago

I wonder if there is a way to restructure things to reduce all of this.

I will try to restructure the giant struct and get back to you.

trevor-crypto commented 1 year ago

I was able to split the args struct and the issue goes away, as you suspected. Albeit successful, the parsing actually takes quite a while compared to argh (which we're currently using with a massive config struct).

Updated playground with broken code

epage commented 1 year ago

How much is the performance off between the two?

What is the performance situation? Large number of argument definitions with very few present?

trevor-crypto commented 1 year ago

Just ran a quick test a few times, with all of the arguments present: (debug build)

clap parse duration: 6.765064496s
argh parse duration: 58.132469ms

You can see the massive arg struct here, where clap is split up and flattened, and argh has it fully inline.

Here is the full example in a repo

epage commented 1 year ago

For reference, on a laptop, I'm getting

clap parse duration: 12.344762ms
argh parse duration: 1.14286ms

EDIT: Updated for release build

Either way, something for me to look into.'

btw if you want something inbetween argh and clap., bpaf might be something to try. My biggest concern over it is some usability aspects due to its functional heritage.

epage commented 1 year ago

According to callgrind, most of the time is spend in memcmp's (40+%). This makes sense because in clap v4, we switched to some naiive data types for lower binary overhead with the assumption that there weren't a lot of unique arguments but instead a large argv would be taken up with a single positional argument passed in from xargs. In general, this is a use case that clap's dynamic design can't compete against parsers with a more static design (as confirmed by callgrind below). Something I want to explore is more hybrid approach for clap so it can be dynamic where needed and static elsewhere.

Switching to clap HEAD (which shouldn't matter), I'm getting

clap parse duration: 10.297272ms
argh parse duration: 1.216893ms

Switching out for HashMap with the default hasher, I'm getting

clap parse duration: 4.226175ms
argh parse duration: 1.198631ms

We could possibly put that behind a feature flag. It will require some breaking changes though and can't be done until 5.0.

With that, callgrind is saying the biggest higher up functions are

55% validate (1x)
- 51% gather_conflicts (170x)
- 32% gather_direct_conflicts (58,140x)
  - 28% hash map entry (58,140x)
  - 28% hash_one
- 15% check-explicit (29,240x)
36% parse_opt_value (170x)
- 36% react (170x)
- 36% start_custom_arg (170x)

From another slice of the problem, 27% of the time is still spent in memcmp

This seems to just be the overhead of a more dynamic design (storing things in a hashmap) compared to a more static design (knowing all arguments at compile time).

trevor-crypto commented 1 year ago

Thanks for looking into this @epage !

I forgot to run my tests in --release, so that is probably why the perf seemed so different. I updated my comment to show that it was a debug build.

In regards to the main issue of the stack overflowing (on SGX and maybe other low mem systems) when there is a massive arg struct, is there any way to offload the parsing to heapspace?

epage commented 1 year ago

The stack overflow isn't in the parsing but in our defining of the command line. We are creating a lot of temporaries. I'm unsure why its not reclaiming stack space during that and what we can do to force it to clean up stack space.

epage commented 1 year ago

While looking at another issue, I found an optimization that I could make that took the test case from ~10ms on my machine to ~3ms.

clap-rs / clap