Open trevor-crypto opened 1 year ago
Happen to be able to get a crash dump? I'm curious which code hits a stack overflow.
clap::Arg
is fairly big. I hope to shrink it as part of https://github.com/clap-rs/clap/discussions/3476.
rustc and llvm don't do an efficient job with removing stack copies, like with types like clap::Arg
: https://arewestackefficientyet.com/
If this is when we read ArgMatches
to populate the struct, then I have no idea what could be improved about that.
EDIT: I tried to reproduce this on x64 Linux but the compile times for hundreds of args was just becoming too much for me.
I suspect a workaround would be to break the Args
up into multiple structs and use #[command(flatten)]
. That should force stack space to get reclaimed.
Thanks for the quick response @epage. I will try your suggestion of breaking up the args into structs then flattening.
Here is a stacktrace:
(gdb) bt
#0 0x00007fffc00b2456 in clap_test::{impl#5}::augment_args (__clap_app=...) at src/main.rs:3
#1 0x00007fffc0073629 in clap_test::{impl#3}::command () at src/main.rs:3
#2 0x00007fffc0061010 in clap::derive::Parser::parse<clap_test::A> ()
at ~/.cargo/registry/src/github.com-1ecc6299db9ec823/clap-4.0.27/src/derive.rs:82
#3 0x00007fffc0070cb1 in clap_test::main () at src/main.rs:349
Seems like it comes from clap::Args
derive macro generation of augment_args
?
Huh, so it is augment_args
. Nothing is being created on the stack. We do have temporaries being created for passing into functions. There are likely also a lot of moves happening.
I wonder if there is a way to restructure things to reduce all of this.
I wonder if there is a way to restructure things to reduce all of this.
I will try to restructure the giant struct and get back to you.
I was able to split the args struct and the issue goes away, as you suspected. Albeit successful, the parsing actually takes quite a while compared to argh
(which we're currently using with a massive config struct).
Updated playground with broken code
How much is the performance off between the two?
What is the performance situation? Large number of argument definitions with very few present?
Just ran a quick test a few times, with all of the arguments present: (debug build)
clap parse duration: 6.765064496s
argh parse duration: 58.132469ms
You can see the massive arg struct here, where clap
is split up and flattened, and argh
has it fully inline.
For reference, on a laptop, I'm getting
clap parse duration: 12.344762ms
argh parse duration: 1.14286ms
EDIT: Updated for release build
Either way, something for me to look into.'
btw if you want something inbetween argh and clap., bpaf
might be something to try. My biggest concern over it is some usability aspects due to its functional heritage.
According to callgrind, most of the time is spend in memcmp's (40+%). This makes sense because in clap v4, we switched to some naiive data types for lower binary overhead with the assumption that there weren't a lot of unique arguments but instead a large argv would be taken up with a single positional argument passed in from xargs. In general, this is a use case that clap's dynamic design can't compete against parsers with a more static design (as confirmed by callgrind below). Something I want to explore is more hybrid approach for clap so it can be dynamic where needed and static elsewhere.
Switching to clap HEAD (which shouldn't matter), I'm getting
clap parse duration: 10.297272ms
argh parse duration: 1.216893ms
Switching out for HashMap
with the default hasher, I'm getting
clap parse duration: 4.226175ms
argh parse duration: 1.198631ms
We could possibly put that behind a feature flag. It will require some breaking changes though and can't be done until 5.0.
With that, callgrind is saying the biggest higher up functions are
From another slice of the problem, 27% of the time is still spent in memcmp
This seems to just be the overhead of a more dynamic design (storing things in a hashmap) compared to a more static design (knowing all arguments at compile time).
Thanks for looking into this @epage !
I forgot to run my tests in --release
, so that is probably why the perf seemed so different. I updated my comment to show that it was a debug build.
In regards to the main issue of the stack overflowing (on SGX and maybe other low mem systems) when there is a massive arg struct, is there any way to offload the parsing to heapspace?
The stack overflow isn't in the parsing but in our defining of the command line. We are creating a lot of temporaries. I'm unsure why its not reclaiming stack space during that and what we can do to force it to clean up stack space.
While looking at another issue, I found an optimization that I could make that took the test case from ~10ms on my machine to ~3ms.
Please complete the following tasks
Rust Version
rustc 1.67.0-nightly (a28f3c88e 2022-11-20)
Clap Version
4.0.27
Minimal reproducible code
Too large to show paste: https://gist.github.com/rust-play/708d645d1c26354657fb8df945718ccc
.cargo/config
:Cargo.toml
:Steps to reproduce the bug with the above code
x86_64-fortanix-unknown-sgx
targetcargo run -- --help
using the minimal reproducible code abovecargo run --release -- --help
Actual Behaviour
Expected Behaviour
The
--debug
profile should work correctly, as--release
does, and not trigger a stack overflow.Additional Context
I know this is an obscure issue due to the fact that the amount of args is insanely large, but I actually have a use-case that is similar to this since with Fortanix Intel SGX, you do not have I/O for a config file, or the ability to use environment variables. I have already increased
stack-size
, but it doesn't seem to help after a certain point.I don't expect this to be addressed, so it would be great to just have some insight into how to "please"
clap
(clap's parser?) so that it doesn't stack overflow.Debug Output
None, unfortunately.