epage / clapng

A full featured, fast Command Line Argument Parser for Rust
docs.rs/clap
Apache License 2.0
0 stars 0 forks source link

Allow Enum Variants as Arg|ArgGroup|SubCommand keys #82

Open epage opened 2 years ago

epage commented 2 years ago

Issue by kbknapp Sunday Nov 12, 2017 at 20:11 GMT Originally opened as https://github.com/clap-rs/clap/issues/1104


I.e. remove the "stringly typed" nature of clap

How this is done internally is up for debate and the reason for this issue. See the related topic #1041

Basically we want to Hash whatever key is given and store that instead of a &str/String/Cow<_>.

epage commented 2 years ago

Comment by kbknapp Sunday Nov 12, 2017 at 20:13 GMT


Even without a derivable component we'll need to iron out the constraints that are part of clap which ArgMatches::value_of accepts instead of &str

epage commented 2 years ago

Comment by Coder-256 Thursday May 30, 2019 at 06:04 GMT


Is this still being considered? I really like this approach for a number of reasons. I saw the recent blog post didn't mention it which is why I am asking.

I'm not sure if this is possible, and I think it this would rely on rust-lang/rfcs#2593. Anyway I imagine the following scenario. The user defines an enum. Each variant represents an argument. You call ArgMatches::get_value<Variant: Args>() -> &str where Variant is the variant type itself (see the linked RFC). That function gives you an instance of T, and you can get information from the associated value.

For example (pseudocode):

// in Clap:

struct Arg<Variant> {
    // everything here remains the same
}

// here `Args` is the enum itself
struct App<Args> {
    ...
    fn arg<T: Into<Arg<Self::Args>>>(self, a: T) -> Self
    fn get_matches(self) -> ArgMatches<Self::Args>
}

// here `Args` is the enum itself
impl ArgMatches<Args> {
    ...
    fn get_value<Variant: Self::Args> -> &str {
        ...
    }
    fn get_values ...
}

// user code (no argument name strings!):

enum MyArgs {
    Username,
}

let matches = App<MyArgs>::new("foo")
    .arg(
        Arg<MyArgs::Username>::new().short("u")
    )
    .get_matches();

let username = matches.get_value::<MyArgs::Username>();

Alternatively, if the Rust team ends up deciding to allow you to restrict enum variants more (not currently planned AFAIK):

// in Clap:

// here `Args` refers to a variant of an enum
struct Arg<Variant: Args> {
    fn new_single() { ... } // Only allowed if the variant has an associated string
    fn new_multiple() { ... } // Only allowed if the variant has an associated iterator
    fn new_boolean() { ... } // Only allowed if the variant has an associated boolean
}

// here `Args` is the enum itself
impl ArgMatches<Args> {
    ...
    fn get_value_automatic<Variant: Self::Args> -> Variant {
        ...
    }
}

// user code (no argument name strings!):

enum MyArgs {
    Usernames(IntoIterator<String>),
}

let matches = App<MyArgs>::new("foo")
    .arg(
        Arg<MyArgs::Usernames>::new_multiple().short("u")
    )
    .get_matches();

let MyArgs::Usernames(usernames) = matches.get_value();

This way, there is a single function get_value that returns the appropriate return type for any variant you give it. That means for example you can't pass a multi-argument to ArgMatches::value_of like you can today.

Anyway this is all still up in the air anyway since it seems like the Rust team hasn't settled on exactly what features they plan to add to enums. What do you think?

epage commented 2 years ago

Comment by iamalicecarroll Thursday Jan 14, 2021 at 17:37 GMT


Is there a workaround to this?

epage commented 2 years ago

Comment by epage Monday Jul 19, 2021 at 18:56 GMT


I tried to read up on past effort for enums but the blog post was removed as part of the new site and for some reason the wayback machine doesn't have that one article

@kbknapp happen to have this hanging around somewhere?

epage commented 2 years ago

Comment by kbknapp Monday Jul 19, 2021 at 19:09 GMT


I believe I do. I'm on mobile at the moment and will look it up shortly once I get to a computer.

epage commented 2 years ago

Comment by kbknapp Monday Jul 19, 2021 at 21:07 GMT


I found it.

Here is the post content # Issue: Removing Strings One of the biggest complaints about clap is the “stringly typed” nature of arguments. You give them a string name, and retrieve their value by string. Where this goes wrong is if you typo a string name, you won’t know until you hit that code path. Combined with clap goes to decent lengths to ensure the clap::ArgMatches struct keeps memory usage low which means stripping out any arguments which weren’t used at runtime. This means the clap::ArgMatches struct doesn’t know if you typo’ed an argument value, or it simply wasn’t used at runtime. The solution to above problem would be keeping a list of all arguments in the clap::ArgMatches struct. However, for CLIs that have hundreds of arguments, the memory usage adds up quickly. So there are two (actually three) problems being conflated here: * Users can typo argument names with no compiler help catching this (the can actually typo names in a number of places, definition *and* each individual access) * clap::ArgMatches has no knowledge of defined arguments * During parsing, clap must store, iterate, and compare all these strings many, many times which would be much more efficient with other types. ## Challenge: Typo’ing Argument Names The solution to this would be to use enum variants. This way the compiler can tell you, no such enum variant exists if you typo it. The question then is how to store such a value in each Argument (and clap) for parsing to be able to compare. Another constraint placed on myself, is I don’t want existing clap users to have to upgrade en masse. If you’re happy using Strings, or you have a CLI that works, why spend the resources to change it. That means the requirements are: * Allow enum variants to be used * The solution must also allow for strings to provide an easy upgrade path Before I talk about the solutions I tried, and ultimately chose let me speak about the other challenges. ## Challenge: clap::ArgMatches has no knowledge of defined arguments Ok, so now we have to come up with a solution which also: * Is efficient enough, we can afford to store all defined arguments in a clap::ArgMatches If we quickly look at the current naive solution of storing Strings, so we know our upper bound, it would look something like (note: the actual types are nested and shown here as essentially primitives for brevity): ```rust struct ArgMatchesV2 { args: Vec<(String, Vec)>, subcommand: Box ... } struct ArgMatchesV3 { args: Vec<(String, Option>)>, subcommand: Vec<(String, Box)> ... } ``` I put the `subcomamnd` field in this example, because it’s important to keep in mind this math can explode if multiple levels of sucommands are used. That doesn’t look different! You’re right. However, if one used two arguments at runtime (out of a possible valid 100 for instance) `args` will only consist of two elements. Whereas in v3 it’ll consist of the full 100 elements. Since a String is essentially a Vec, and a Vec is a pointer plus length (usize) plus capacity (usize). That means it’s an additional 98 * (usize*3) *plus* whatever the actual String bytes are that are allocated, which if we just say an average of five bytes per String for simple math, and on my system a pointer/usize is 8 bytes, the difference between the above ArgMatchesV2 and ArgMatchesV3 is: * V2 = 2*((8*3)+(5*5)) = 2*49 = 98 bytes. * V3 = 100*49 = 4,900 bytes. That’s just args. Look at subcommands: Same math as before, but with the way subcommand work you can only ever have a single subcommand per “level”. However, now we’d need to store all possible subcommands, which even if Box::new doesn’t allocate for all the subcommands we didn’t use it’s still an additional String (49 bytes on average) plus a pointer (8 bytes). Now subcommand numbers *usually* don’t go as high as args, possibly 30-50 at most. Let’s just say 25 for easy math: * V2 = 1*8 = 8 bytes * V3 = 25*(49+8) = 1,425 bytes Giving us a grand total of: * V2 = 106 bytes * V3 = 6,325 bytes Ok we have our baseline. ## Challenge: Iterating and comparing strings is slow Ok, whatever solution we go with, iterating Arg struct (and thus whatever this new field is) should be faster than a bunch of Strings. Since we’ve already gone over the math and breakdown of a String, we know it involves a heap lookup (Since Strings are a essentially Vec), so if we can avoid touching the heap that’d be gravy. ## Solution 1: Generics My first attempt at solving this was to use a generic type bound on PartialEq. While implementing this, I had an icky feeling though. All my type signatures were changing to clap::Arg which expands out to clap::ArgMatches. This has a few unfortunate side affects: * People would have to annotate their types, making upgrading to clap v3 a pain * T might be huge, which doesn’t fix the clap::ArgMatches low memory requirement * clap codegen booms because you might have multiple T’s throughout your CLI (this is problem for one of the later challenges) So this doesn’t seem like a good route. Next I tried Trait Objects ## Solution 2: Trait Objects Ok, so what if instead of a type T: ParitalEq I just stored &’a PartialEq (or Box)? This fixes the problem of T being huge, now T is just a pointer (8 bytes). It also fixes the type signature as people still would just use clap::Arg. However, I’m still touching the heap each time I want to compare or find an argument. I put this on the back burner as a *maybe*. ## Solution 3: Hashing Ok, so what if I simply hash T and get a consistent output (I picked u64, or 8 bytes)? Sure, I’ll have to hash all 100 T’s at definition time, but clap iterates and compares each argument *far* more than once per argument. So it seems a small price to pay, especially if I only need a fast hashing algorithm instead of a super secure one. So I picked a FNV hash implementation which is insanely fast. Hashing a T using FNV is only a few ns for all the types of T I tested. Using this, clap::Arg can now store an `id: u64` (8 bytes vs the 24 bytes of a String), iterating and comparing a bunch of u64’s is *significantly* faster than a bunch of Strings. Without going through all the math again, here are grand totals using u64 instead of String (not to mention u64 doesn’t allocate any additional bytes for it’s content): * V2 = args(2*8)+subcommands(8) = 24 bytes * V3 = args(100*8)+subcommands(25*8) = 1,000 bytes Much improved. The jury is still out on whether I’ll provide the ability to know if an argument was used or not. 1,000 bytes (nearly 1KB) is still a good bit, compared to the 24 it could be. Granted, this is for a large CLI (100+ args and 25+ subcommands at a single level), so actual uses should be lower. We’ll see, although 1,000 bytes is much easier to swallow than over six times that. The biggest downside is each access now must be hashed. The reason I’m OK with this, is access is much less frequent, and typically only happens once. So assuming 100 arguments, we’re comparing 200 hashes with that of storing, iterating, and comparing a larger heap based type for more than 200 times. I haven’t counted lately, but due to how parsing and validation occurs the number of times an argument gets compared is in the order of (NUM_ARGS * NUM_USED_ARGS)*~3ish. Concretely that means, if you use an average of 5 arguments per run, out of a possible valid 100, there are roughly 1,500 comparisons of equality. For CLIs like ripgrep which can have hundreds or thousands of arguments per run, this number can dwarf 200 hashes easily. ## Status I’m in the middle of implementing this change. It’s somewhat conflated with re-working how the parsing happens in order to be more efficient. This is the last major change prior to releasing v3-beta1 which I’m hoping will happen in the coming weeks/month.
epage commented 2 years ago

Comment by epage Tuesday Jul 20, 2021 at 14:01 GMT


@pksunkara I propose this issue and #1663 be deferred to the clap4 milestone.

epage commented 2 years ago

Comment by epage Tuesday Jul 20, 2021 at 15:14 GMT


With that out of the way, some quick thoughts:

Clap has several nearly distinct APIs

In considering our care abouts, we need to consider each API and what the expected role of it is.

For example, if the expectation is someone will probably use the derive, unless they want to get rid of the overhead of proc macros, then derives for custom keys (#1663) seems contradictory.

Similarly, the derive API prevents the problems with stringly typed APIs. On a related note, I'm unsure if we can generate the needed enums for the derive API to use anything but strings. We can't see through to other enums and structs to generate a global list. We'd need to support a composable enum and then somehow allow all of the typing for App with subcommands to magically work. This would also bloat the binary because we'd have multiple copies of App and ArgMatches being compiled.

As for performance, while keys use less memory, &'static str are fast to compare by equality. I suspect Rust is short-circuiting with a pointer check. We just can't use it because sometimes people use generated strings.

We might be able to resolve the generics problem via type aliases (e.g. pub type App = BaseApp<&'static str>).

One problem with enums is that we default value_name to be the name. We can't do this with enums unless we require a Display or as_str() on it.

Regarding the hash work in the above blog post, one concern brought up at the time was collisions. I can't find the discussions to see how that was resolved.

I propose we rename App et all and make them generic. We then provide type aliases that set the generic parameter to &'static str. This will require those who need allocations to go through more hoops, though they could choose to use kstring::KString (#1041).

See playground for example code

Benefits

Downsides

epage commented 2 years ago

Comment by epage Tuesday Jul 20, 2021 at 17:22 GMT


Just wanted to call out https://github.com/clap-rs/clap/issues/604 is the issue for "error on typo"