cessen / ropey

A utf8 text rope for manipulating and editing large texts.
MIT License
1.04k stars 46 forks source link

Bidirectional iterators #18

Closed cessen closed 5 years ago

cessen commented 5 years ago

Some client code may want to iterate over e.g. chunks in both directions. For example, DFA regex need to scan backwards after finding a match to calculate where the match begins.

The tricky bit isn't technical, however. It's having a standard API to interoperate with other libraries. Unfortunately, the Rust standard library doesn't provide a bidirectional iterator trait, so this probably isn't useful to implement yet: any API I come up with will be specific to Ropey, and therefore wouldn't interoperate with other libraries anyway.

I'm creating this issue as a reminder to add this functionality if/when a bidirectional iterator trait (or equivalent) is added.

AljoschaMeyer commented 5 years ago

Considering you rejected DoubleEndedIterator on Lines, are you also opposed to DoubleEndedIterator on Chars? On Chars, it would be possible to do a fairly naive implementation that simply scans for character boundaries from the back of the string (it's still O(1) since there is a bounded amount of bytes to scan (4)).

cessen commented 5 years ago

Hi @AljoschaMeyer, thanks for the question!

It's not a matter of algorithmic complexity or implementation difficulty. All of Ropey's iterators (lines included) could implement DoubleEndedIterator in a reasonable way. The issue is that it's not the right API.

What you generally want for text is something along the lines of C++'s bidrectional iterator concept. It's essentially a "cursor" that you can move either forwards or backwards through the text as you please. The API surface is very simple: you just add a prev() method to complement the standard iterator's next() method. (Well... that, and we'll also want to provide methods for creating iterators starting from arbitrary points in the rope.)

The DoubleEndedIterator API doesn't allow anything like that, and is IMO very limited in scope and application. Implementing it wouldn't provide what you need for text processing generally.

Having said that, I now feel like I've been holding my breath for too long, waiting for an appropriate trait in std. I should probably just bite the bullet and hard-code this into Ropey's iterators for now. If an appropriate trait makes its way into std (or becomes widely used on crates.io) at some point in the future, I can adopt that then.

Once a prev() method (or whatever API turns out to make sense) has been added to the iterators, we can then pretty trivially implement DoubleEndedIterator in terms of that. I just don't want to maintain specialized code for something with such limited application.

I hope that answers your question!

AljoschaMeyer commented 5 years ago

Yes, that answers it, thanks. I personally don't need that full flexibility, only the ability to iterate backwards from the end to the beginning of the string. But your reasoning makes sense.

I'm not quite sure why to wait for a std trait though. xi-rope offers a Cursor, im-rs has a Focus. Ad-hoc polymorphism would be nice, but any std trait could always be retroactively implemented on the cursor struct.

Avi-D-coder commented 5 years ago

In using ropey for lsp-diff-server, text manipulation ergonomics was the biggest problem. Feature parity with String would be slightly better, but is still not great for large amounts of text.

A Cursor API is a good step, but I think unit(Line, Char, Byte, UTF-16 Char, ...) based indexing is a more general and ergonomic solution. I will try to make a proof of concept soon. Ideally all the iterators would be provided by a string units library.

cessen commented 5 years ago

@AljoschaMeyer

Yeah, that's basically what I expect I'll do now, except without introducing any new types (the existing iterators should work fine, just adding a prev() method or something along those lines).

The reason I'd ideally like to implement a std (or at least widely adopted) trait is a matter of interoperability. I can't expect, for example, a regex crate to care about Ropey. Instead, I need Ropey to provide a common API that other crates can consume. You're right, of course, that it can always be implemented later. But if the API doesn't match, that also makes Ropey's API's messier. But I guess I can always just mark things as deprecated then.

@Avi-D-coder

but I think unit(Line, Char, Byte, UTF-16 Char, ...) based indexing is a more general and ergonomic solution

I'm not sure what you mean here by "unit based indexing"? You can already index into a Ropey rope by line, char, and byte offset... so I assume that's not what you mean? But I'm not sure what else you could mean here.

AljoschaMeyer commented 5 years ago

@Avi-D-coder

A Cursor API is a good step, but I think unit(Line, Char, Byte, UTF-16 Char, ...) based indexing is a more general and ergonomic solution

But the time complexity is worse: iterating via a properly implemented focus is O(n), random access of each element is O(n log n)

@cessen

Yup, thanks for hearing me out. Please don't feel pressured to implement this. I (most likely) won't make the time to contribute it, so neither do I except you to do so. I'm not blocked, and I'm grateful for the crate :)

cessen commented 5 years ago

@AljoschaMeyer

No worries! Thanks for bringing this up. I think I needed to be poked about it, ha ha. :-) I probably won't get to it for a bit--I'm currently pretty busy with other things. But I'll get to it when I have a reasonable chunk of time.

I expect I'll start with the Chunks iterator, since most of the other iterators are built on top of that. Working on the other iterators after that should be pretty straightforward, at least for an initial non-optimized implementation.

cessen commented 5 years ago

I've started implementing this in the bidir_iter branch. If people want to start kicking the tires, so to speak, I'd really appreciate any feedback (and bug reports)!

Current status:

cessen commented 5 years ago

All functionality has been implemented now, for all four types of iterators. I've also merged into master and deleted the bidir_iter branch. As always, testing is appreciated!

Documentation is extremely sparse still, so I'm leaving this issue open until I properly document things.

cessen commented 5 years ago

Okay, several bug fixes and documentation passes later, and I believe everything is ready.

(EDIT: the documentation for master can be found here. Probably important for getting feedback on the API! The documentation in the iter module is a good place to start for this new functionality.)

I do have one API question for those of you who are interested in this feature: right now the various chunks_at_*() methods return not just the Chunks iterator, but also the starting byte, char, and line-break indices of the chunk that the returned iterator starts at.

This mirrors the individual chunk-fetching methods, and I think it's the right API because without that information you don't actually know where you ended up in the rope. But it also feels a little awkward compared to the other iterator-making methods, including vanilla chunks(). And you can still technically get that information via a second call to the corresponding individual chunk-fetching method, albeit with the additional cost of that second call.

What do you guys think? I'm basically just trying to think if there are significant use-cases where you would want to create a chunk iterator this way and not need that information. And I think the answer is no. But I want to give a chance for some feedback before committing to the API.

If no one has any feedback in the next two weeks, I'll stick with what I've already implemented and make a new release (v1.1.0) with this new functionality.

AljoschaMeyer commented 5 years ago

The API looks solid, this does exactly what I would need. I ended up implementing cursors for a 2-3 tree and a persistent array, and it will be nice not to have to reimplement ropes as well.

Some minor wishes:

This mirrors the individual chunk-fetching methods, and I think it's the right API because without that information you don't actually know where you ended up in the rope.

Tentative agree.

cessen commented 5 years ago

How about a current method that returns the item to the right of the conceptual position

Oh, yeah, I like that. I think to keep with the "iterators are in-between items" concept, having two methods for forward and behind might make more sense, if only for keeping the mental model consistent. Maybe peek_next() and peek_prev().

If you'd like to take a crack at that yourself, let me know. I likely won't go for it myself before the next release, but I would definitely like Ropey to have this functionality.

In principle, foo_at_start can be implemented more efficiently than foo_at(0) because it could skip the comparisons when traversing the tree structure.

Iterator creation performance is already quite good IMO, so I'm not super inclined to further complicate the code to special-case-optimize start/end creation unless we actually need to. I also suspect that the performance gain would be marginal at best anyway, as I don't think the comparisons are taking much time (though admittedly I haven't measured). If start/end iterator creation in particular become an actual bottleneck for someone, of course, then I'll be happy to look into it. But until then I'd rather leave things as-is.

However, what I would like to do performance-wise is properly optimize the Lines iterator. Right now it's essentially just calling line(line_idx) repeatedly with an incrementing index. So it's way slower than it needs to be. I think it's inevitable for the Lines iterator to be the slowest of Ropey's iterators, but it's really, really brain-dead right now. I'm confident it can be much faster.

I've punted on a better implementation for now because I'm pretty sure Lines will be especially complex to implement efficiently, and I wanted to get something that reliably works out the door first. But I do want to go back and fix that at some point. It's an unfortunate standing wart until then.

AljoschaMeyer commented 5 years ago

I think to keep with the "iterators are in-between items" concept, having two methods for forward and behind might make more sense, if only for keeping the mental model consistent.

Good point. The current proposal stemmed from a different mental model: A cursor as an index into a (conceptual) linked list, with prev and next shifting the cursor and current retrieving the item it points at (which might be the nil anchor of the list). I do agree that in the position-in-between-elements model, supporting peek_next and peek_prev is more appropriate.

If you'd like to take a crack at that yourself, let me know. I likely won't go for it myself before the next release, but I would definitely like Ropey to have this functionality.

I can't offer more than a half-hearted "perhaps at some point in the rather far future" unfortunately. But I did take a look at the implementation, and found it very enjoyable to read/navigate. It would be lovely if the balancing scheme was clearly stated somewhere - is it a B-Tree?

cessen commented 5 years ago

It would be lovely if the balancing scheme was clearly stated somewhere - is it a B-Tree?

I have a design document that gives an overview of Ropey's design here: https://github.com/cessen/ropey/blob/master/design/design.md

And also, yes, Ropey is implemented as a B-Tree rope. :-)

cessen commented 5 years ago

It's been two weeks, and so far the only feedback is for additional features which can be added later. I also feel confident about the APIs now after having a chance to think about them more. So I'm calling this done, and will soon make a new release with these new iterator features.

cessen commented 5 years ago

v1.1.0 is now live on crates.io!

tadeokondrak commented 4 years ago

Once a prev() method (or whatever API turns out to make sense) has been added to the iterators, we can then pretty trivially implement DoubleEndedIterator in terms of that.

Is there a reason this wasn't done or was it just an oversight? The .rev() method on Chars would be useful to me.

cessen commented 4 years ago

@tadeokondrak

Ah, it's mostly an oversight, yeah. Sorry about that!

Having said that, it turns out there are some mildly tricky things I hadn't thought about at first, due to being able to construct iterators at any point in the rope. For example, what would you expect rev() to do if the iterator is constructed to be in the middle of the rope? Would the rev() still put you at the end? And if so, then how would you construct a reversed iterator at an arbitrary point?

So, another possibility I've thought about is to add our own reverse() or switch_direction() method, or something hopefully not too confusingly similar in name to rev(). Instead of switching to a secondary iterator at the end of the rope (like rev() does) it would just change the direction of the iterator in its current position.

Do you have any thoughts about any of that? I definitely want to make sure you get what you need out of this. But I think some of these issues need to be thought through. Maybe it's worth opening a new issue for that.

cessen commented 4 years ago

@tadeokondrak I created a new issue (#31) to discuss the appropriate design for reversing iterators. Feel free to leave any feedback there!