Rollback iterator to avoid cloning tokens

Bromeon commented 1 year ago

We have now 4 occurrences of this:

// TODO consider multiple-lookahead instead of potentially cloning many tokens

And in the last days I was thinking how to approach that best. First I had the idea of a multi-peek iterator (which would allow looking ahead several tokens instead of just 1). While thinking how to apply this in venial, another idea came up, which seemed more ergonomic to me: a rollback iterator, meaning that you would normally call next() but could rollback at any time to a previously set backup point.

Rough sketch:

fn consume_stuff(tokens: &mut TokenIter) {
    let tokens = tokens.start_attempt(); // new iterator type, or runtime bool flag

    ... // call next() many times

    if not_what_i_expected {
        tokens.rollback(); // revert to state at beginning of the function
    }

} // drop would commit the attempt

(it's also possible to have explicit commit and rollback-on-drop instead)

This can be implemented via VecDeque, which keeps the tokens since the backup point in the buffer. When rolling back, the iterator would pop_front() the buffer first, before calling next() on the underlying iterator. Otherwise it would advance as usual.

One thing to note: for rollbacks to work, the basic TokenIter type which is currently used everywhere would have to be a different iterator type that supports this buffer. It's not possible to only selectively use a rollback iterator in the places where rollback is used, because the more general TokenIter cannot rewind. So we would need to think which of the Iterator methods are worth overriding. Probably size_hint(), but I'm a bit reluctant to touch all the others for possible optimization, at the big risk of introducing bugs.

What do you think about this?

PoignardAzur commented 1 year ago

Some scattered thoughts:

I think this might impact performance enough that we might want to do some benchmarks between multiple approaches.
The cases where we need rollback seem sparse enough that we might want a TokenIterWithRollback type separate from TokenIter.
I feel like a multi-peek iterator might still be worth considering. There aren't that many cases where you need more than one token of lookahead to parse Rust. On the other hand, I guess a rollback iterator is just an advanced form of multi-peak.
If we can expect that the vast majority of cases will only roll back a few tokens, we might want to use a SmallVec or some similar data structure to avoid allocations. Though I wouldn't want to add another dependency unless the performance benefit was really big.

Bromeon commented 1 year ago

I think this might impact performance enough that we might want to do some benchmarks between multiple approaches.

There's currently no infrastructure for benchmarks in the library, but I remember you did some benchmarking on your own in the past. I assume those were mostly ad-hoc scripts/commands and not something you could add in a benches directory?

The cases where we need rollback seem sparse enough that we might want a TokenIterWithRollback type separate from TokenIter.

venial's tree-like architecture means that even specialized parsing methods (that might be rollbacked) typically just act as a part of a large parse_declaration() call, and that there may be many parsing methods afterwards.

What you suggest would mean that we could convert TokenIter -> RollbackIter, do the specialized operations, and then losslessly convert back RollbackIter -> TokenIter and continue with the original iterator. This is not the case however -- we cannot create an original TokenIter (that lacks an internal buffer) from a possibly-advanced RollbackIter, because the original one can only walk forward, and this walking has already been done.

So I see only two options: clone the iterator (current approach), or transparently retain a buffer in TokenIter, too.

What we can do is have another iterator type during the "attempt" phase where a backup point is set. This is more for type safety (type state pattern) and micro-optimization (less branches), not a semantic requirement. My hope is that such iterators would be declared locally and not infest many function signatures, otherwise we might need to think about a runtime flag or make many functions generic.

I feel like a multi-peek iterator might still be worth considering. There aren't that many cases where you need more than one token of lookahead to parse Rust. On the other hand, I guess a rollback iterator is just an advanced form of multi-peak.

Yes, they are both very similar and can be implemented with an internal VecDeque buffer. What I wanted to avoid is that in addition to the existing next()/peek() pattern, we would need to write parsing code completely differently using peek()/peek_nth(2), when rollbacking is needed.

If we can expect that the vast majority of cases will only roll back a few tokens, we might want to use a SmallVec or some similar data structure to avoid allocations. Though I wouldn't want to add another dependency unless the performance benefit was really big.

Yes, a very low-hanging fruit would just be to use [Option<I::Item>; 3] instead of VecDeque, i.e. a fixed-size ringbuffer that works for the amount of lookahead we actually need. I agree with not adding dependencies.

PoignardAzur commented 1 year ago

What we can do is have another iterator type during the "attempt" phase where a backup point is set. This is more for type safety (type state pattern) and micro-optimization (less branches), not a semantic requirement. My hope is that such iterators would be declared locally and not infest many function signatures, otherwise we might need to think about a runtime flag or make many functions generic.

Yes, that's what I was thinking. I was mostly thinking about the "micro-optimization" part (I'd rather not add a branch to every single token read if they're only needed for a very small subset of the parsed programs).

Yes, a very low-hanging fruit would just be to use [Option; 3] instead of VecDeque, i.e. a fixed-size ringbuffer that works for the amount of lookahead we actually need. I agree with not adding dependencies.

I guess it depends on whether we can estimate the lookahead needed and we find that it's always a small number. In that case then, yeah, a small static buffer would make sense.

There's currently no infrastructure for benchmarks in the library, but I remember you did some benchmarking on your own in the past. I assume those were mostly ad-hoc scripts/commands and not something you could add in a benches directory?

Yup, it was mostly ad-hoc. I guess making a benches directory would be a good idea regardless of this issue.

dmgolembiowski commented 1 year ago

@PoignardAzur Random idea:

Could the rollback iterator be made by injecting BEGIN_<uuid> tokens that are removed at compile time, but still present a type signature?

E.g.

quote! {
    let #rollback_rand_ident = {
        if false {
           const rollback_ty = gen_placeholder_type!();
            <BEGIN_#rollback_rand_ident as #rollback_ty >::new()
        } else {
           () as _
        }
    };
}

PoignardAzur commented 1 year ago

I'm not sure I visualize what you're proposing, but that doesn't really look like a solution that fits our problem space.

For one thing, there will be places where we want to rollback that aren't expressions and where the grammar wouldn't accept those BEGIN_<uuid> tokens.

PoignardAzur / venial

Rollback iterator to avoid cloning tokens #41