blyxxyz / lexopt

Minimalist pedantic command line parser
MIT License
294 stars 9 forks source link

The position() function can be tricky to use in while-let loops #3

Closed nordzilla closed 3 years ago

nordzilla commented 3 years ago

Consider this code:

fn find_first_long_value_with_position() -> Result<Option<(usize, String)>, Error> {
    let mut p = parse("-s --long");
    while let Some(arg) = p.next()? {
        match arg {
            Long(val) => return Ok(Some((p.position(), val.to_owned()))),
            _ => (),
        }
    }
    Ok(None)
}

Unfortunately, this code does not compile due to a double borrow:

|
|   while let Some(arg) = p.next()? {
|                         - mutable borrow occurs here
|       match arg {
|           Long(val) => return Ok(Some((p.position(), val.to_owned()))),
|                                        ^             --- mutable borrow later used here
|                                        |
|                                        immutable borrow occurs here

There are several ways around this; one of them is to simply swap the order of the tuple:

fn find_first_long_value_with_position() -> Result<Option<(String, usize)>, Error> {
    let mut p = parse("-s --long");
    while let Some(arg) = p.next()? {
        match arg {
            Long(val) => return Ok(Some((val.to_owned(), p.position()))),
            _ => (),
        }
    }
    Ok(None)
}

More generally, we just need to ensure that val's non-lexical lifetime has ended before the call to p.position(), so even with the original tuple ordering, we could do something like:

match arg {
    Long(val) => {
        let val = val.to_owned();
        return Ok(Some((p.position(), val)));
    }
    _ => (),
}

I don't think there is a clean way around this while still exposing position() as a public function that takes &self as an argument. After all, even bin_name() has the same difficulties, but I don't think anyone will be generating tuples of arguments with the bin name.

In my opinion, there are several paths forward:

1) Revert the position() functionality that was implemented in #2. This would make me a bit sad, because I'd really like to see this functionality available in this library, but I would understand.

2) Leave things the way they are, and thoroughly document how to properly navigate the borrowing nuances. The current functionality is correct and fully usable. It just has some pain points with the borrow checker if you're not careful.

3) Make the position() function private (or remove it), and change the definition of Arg to always include the position:

pub enum Arg<'a> {
    /// A short option, e.g. `-q`.
    Short(usize, char),
    /// A long option, e.g. `--verbose`. (The dashes are not included.)
    Long(usize, &'a str),
    /// A positional argument, e.g. `/dev/null`.
    Value(usize, OsString),
}

Users who don't care about the position could just ignore it in the match:

match arg {
    Short(_, val) => unimplemented!(),
    Long(_, val) => unimplemented!(),
    Value(_, val) => unimplemented!(),
}

4) Make the position() function private (or remove it), and make the position data member an Rc<Cell<Option<usize>>>. Then handle the complexity internally, providing two ways to get the next Arg: one that also provides the position, and one that does not.

We could add a function like the following:

pub fn next_enumerated(&mut self) -> Result<Option<(usize, Arg<'_>)>, Error> {
    let pos = self.position.clone(); // Just cloning the Rc
    self.next().map(|option| option.map(|arg| (pos.get().expect("argument count overflow"), arg)))
}

This is more in line with my original suggestion of creating an EnumeratedParser, except we're only adding one new function.

I like option 4) the most because it supports the position functionality without altering the existing functionality, and without user-facing borrow-checker complexity.

Importantly, this is still a breaking change, and so is adding the position() function in the first place. Initially, I thought it wasn't, but I realized that it is after giving it more thought. (See example playground.) I understand if this might change your opinion on the matter, possibly even toward option 1.

For more context, my personal use case for this function is something more like this:

pub struct Token;

// Clone trait bound is just to prevent conflict with the blanket
// implementation of impl<T> From<T> for T
impl<T: Clone> From<T> for Token {
    fn from(_: T) -> Self {
        Self {}
    }
}

#[derive(Debug)]
pub struct EnumeratedToken {
    position: usize,
    token: Token,
}

impl EnumeratedToken {
    fn new(position: usize, token: Token) -> Self {
        Self { position, token }
    }
}

fn lex_enumerated_tokens() -> Result<Vec<EnumeratedToken>, Error> {
    let mut p = lexopt::Parser::from_env();
    let mut tokens = Vec::with_capacity(std::env::args_os().len());
    while let Some((position, arg)) = p.next_enumerated()? {
        match arg {
            Short(val) => tokens.push(EnumeratedToken::new(position, val.try_into()?)),
            Long(val) => tokens.push(EnumeratedToken::new(position, val.try_into()?)),
            Value(val) => tokens.push(EnumeratedToken::new(position, val.try_into()?)),
        }
    }
    Ok(tokens)
}

I also want to note that whatever decision we come to (if it requires implementing new functionality), I'm happy to submit a pull request or do some code review, if you're open to such contributions.

Thanks again for your hard work on this!

blyxxyz commented 3 years ago

I think this only comes up if you're working on such an abstract level that you don't actually care about the option contents yet. In more typical code you'd match on Long("foo") and then stop borrowing from Parser. So it's acceptable if it's a bit clunky, you won't be writing it a lot of times.

It would be nice if the borrow from Parser::next() were somehow considered immutable instead of mutable, since it can't be used to mutate Parser after next returns, but I don't think there's a way to fit that into Rust's type system.

It would alternatively be nice to (have a way to) decouple arg's lifetime from Parser's entirely, see this reddit thread. That would solve other problems that can't be solved by just reordering expressions.

Importantly, this is still a breaking change, and so is adding the position() function in the first place. Initially, I thought it wasn't, but I realized that it is after giving it more thought. (See example playground.) I understand if this might change your opinion on the matter, possibly even toward option 1.

Because it might conflict with a trait method? I think that's usually considered acceptable breakage, and I don't think implementing a trait for Parser should be very common.

Vec::with_capacity(std::env::args_os().len())

Typically you'd want to call args_os() only once, it copies all the arguments into a Vec before it returns.

You're also over-estimating by one because you're including the first argument.

(This might warrant some sort of note in from_env()'s docs.)


All that said, I am reconsidering the solution to #2 after sleeping on it another night. I recommended against the solution in the gist because it looked ugly, but would it be enough? I think position() is pretty niche, and the Rc<RefCell<>> workaround should function just fine. It might even help with #4.

nordzilla commented 3 years ago

Because it might conflict with a trait method? I think that's usually considered acceptable breakage, and I don't think implementing a trait for Parser should be very common.

Fair enough. I know it would be incredibly uncommon, but I just wanted to double check since it can break code. I don't currently maintain any library code myself, and I am still learning about the nuances of what people consider breaking changes and how to handle them.

Typically you'd want to call args_os() only once, it copies all the arguments into a Vec before it returns. You're also over-estimating by one because you're including the first argument.

That is definitely unfortunate about the Vec. However, the over-estimation is fine. Since the arg count <= token count due to unseparated short options and combined short options.

All that said, I am reconsidering the solution to #2 after sleeping on it another night. I recommended against the solution in the gist because it looked ugly, but would it be enough?

Yes, that's understandable. My hope was that we might find a clean way to enable first-class support for positions, but the Rc<RefCell<>> workaround is still viable, and I'm still open to experimenting with that.

Though I am curious to hear your thoughts about reconsidering the implementation in #2 in general, given that you also said, "I think this only comes up if you're working on such an abstract level that you don't actually care about the option contents yet... So it's acceptable if it's a bit clunky, you won't be writing it a lot of times."

Ultimately, the decision is up to you. Thank you for taking the time to entertain my thoughts, and thanks for this library!

nordzilla commented 3 years ago

Just an update, I have modified my code to use the Rc workaround. I did it slightly differently than how you did it in the gist. Rather than wrapping an Rc<RefCell<I>> around the entire iterator, I just wrapped an Rc<Cell<usize>> around the position counter and kept that externally, still moving the iterator inside of lexopt::Parser.

I also was able to achieve the best of both words based on whether the iterator is ExactSize or not:

pub struct Lexer;

impl Lexer {
    pub fn lex_from_env() -> Result<Vec<EnumeratedToken>, Error> {
        Self::lex_exact_size_args(std::env::args_os())
    }

    pub fn lex_args<I>(args: I) -> Result<Vec<EnumeratedToken>, Error>
    where
        I: IntoIterator + 'static,
        I::Item: Into<OsString>,
    {
        Self::lex(args, Vec::new())
    }

    pub fn lex_exact_size_args<I>(args: I) -> Result<Vec<EnumeratedToken>, Error>
    where
        I: IntoIterator + 'static,
        I::Item: Into<OsString>,
        <I as IntoIterator>::IntoIter: ExactSizeIterator,
    {
        let args = args.into_iter();
        let tokens = Vec::with_capacity(2 * args.len());
        Self::lex(args, tokens)
    }

    fn lex<I>(args: I, mut tokens: Vec<EnumeratedToken>) -> Result<Vec<EnumeratedToken>, Error>
    where
        I: IntoIterator + 'static,
        I::Item: Into<OsString>,
    {
        let position = Rc::new(Cell::new(0));
        let mut parser = lexopt::Parser::from_args(EnumeratedIterator::new(args, position.clone()));

        while let Some(arg) = parser.next()? {
            tokens.push(EnumeratedToken::new(position.get(), Token::try_from(arg)?));
        }

        Ok(tokens)
    }
}

Overall, I'm fairly happy with this functionality. It's a little clunky with the external Rc position counter, but not terrible.

blyxxyz commented 3 years ago

I am still learning about the nuances of what people consider breaking changes and how to handle them.

RFC 1105 may be of interest. (Particularly this section.)

Though I am curious to hear your thoughts about reconsidering the implementation in #2 in general

For now I'm tending toward reverting it and maybe turning the gist into an example.

I'm very interested in seeing how you use the library, and maybe that could lead to changes to the API down the line. For the time being I'm biased against changes that have workarounds.

You could vendor lexopt if you want to experiment with things that'd require internal changes.

Rather than wrapping an Rc<RefCell<I>> around the entire iterator, I just wrapped an Rc<Cell<usize>> around the position counter and kept that externally, still moving the iterator inside of lexopt::Parser.

Nice! That's more elegant and efficient than what I came up with.

ExactSize

Since you're not relying on the size being 100% exact, maybe Iterator::size_hint() would be more appropriate? That also lets you get rid of the trait bound.

nordzilla commented 3 years ago

RFC 1105 may be of interest. (Particularly this section.)

Thanks for linking that! This particular section is exactly the case I was worried about and demonstrated in the playground. I will be sure to read through more of that RFC.

Since you're not relying on the size being 100% exact, maybe Iterator::size_hint() would be more appropriate? That also lets you get rid of the trait bound.

Ah yes, great idea! size_hint() would definitely be the most appropriate thing to do here, because then the upper bound would just be Some(usize) for the exact-size iterators. Thanks for the suggestion.

Really looking forward to continuing to use lexopt in my project.

nordzilla commented 3 years ago

For now I'm tending toward reverting it and maybe turning the gist into an example.

I think I've settled on a way of using the Rc<Cell<usize>> that makes it feel a little bit less work-aroundish, even though it's essentially the same as before. Of course, you're welcome to use or change any/all of it if you add a similar example:

PositionTransmittingIterator

use std::{cell::Cell, rc::Rc};

pub struct PositionReceiver(Rc<Cell<usize>>);

impl PositionReceiver {
    pub fn position(&self) -> usize {
        self.0.get()
    }
}

pub struct PositionTransmittingIterator<I: Iterator> {
    iter: I,
    position: Rc<Cell<usize>>,
}

impl<I: IntoIterator> From<I> for PositionTransmittingIterator<<I as IntoIterator>::IntoIter> {
    fn from(iter: I) -> Self {
        Self {
            iter: iter.into_iter(),
            position: Rc::new(Cell::new(0)),
        }
    }
}

impl<I: Iterator> PositionTransmittingIterator<I> {
    pub fn position_receiver(&self) -> PositionReceiver {
        PositionReceiver(self.position.clone())
    }
}

impl<I: Iterator> Iterator for PositionTransmittingIterator<I> {
    type Item = I::Item;

    fn next(&mut self) -> Option<I::Item> {
        self.position.replace(self.position.get() + 1);
        self.iter.next()
    }
}

Lexer

pub struct Lexer;

impl Lexer {
    pub fn lex_args<I>(args: I) -> Result<Vec<EnumeratedToken>, Error>
    where
        I: IntoIterator + 'static,
        I::Item: Into<OsString>,
    {
        let args = PositionTransmittingIterator::from(args);
        let rx = args.position_receiver();
        let mut tokens = match args.size_hint() {
            (_, Some(upper)) => Vec::with_capacity(2 * upper),
            (lower, None) => Vec::with_capacity(2 * lower),
        };
        let mut parser = lexopt::Parser::from_args(args);

        while let Some(arg) = parser.next()? {
            match arg {
                arg @ lexopt::Arg::Short(_) => {
                    let token = Token::try_from(arg)?.enumerate_with(rx.position());
                    tokens
                        .last_mut()
                        .and_then(|last| last.try_combine_with(&token))
                        .unwrap_or_else(|| tokens.push(token))
                }
                arg => tokens.push(Token::try_from(arg)?.enumerate_with(rx.position())),
            }
        }

        Ok(tokens)
    }

    pub fn lex_from_env() -> Result<Vec<EnumeratedToken>, Error> {
        Self::lex_args(std::env::args_os().into_iter().skip(1))
    }
}

Lastly, I think we can reasonably close this issue and #4 at this point, but I will leave that up to you.

Thank you again so much for all of your consideration and help.

blyxxyz commented 3 years ago

I've reverted the commit and might add an example later. Thank you again for your thoughts.