cessen / ropey

A utf8 text rope for manipulating and editing large texts.
MIT License
1.04k stars 46 forks source link

`TextInfo` in `Chunks` API #75

Open truchi opened 1 year ago

truchi commented 1 year ago

Hello cessen,

How do you traverse the Chunks and get the byte/char/line index of those chunks? That would be awesome if Chunks: Iterator<Item = (TextInfo, &str)>!

Is this desirable/achievable? Do you think it's a good first issue?

Thanks

cessen commented 1 year ago

Hi truchi,

It's certainly not a bad idea. But I'm curious what your use case is? I'm trying to avoid adding more surface area to Ropey's already (IMO) too-large API. So without a specific concrete use case, I'm hesitant to add this.

truchi commented 1 year ago

Hi!

What I really wish to see in ropey is a unified Cursor API that works well with tree-sitter:

but that would be a complete overhaul of the existing API (and surely you did your thing for good reasons).

So I guess the easiest thing we can do to get close is to give the TextInfo (or some public variant) of the iterated chunk in Chunks. If I understand correctly the code knows this, preventing users to use str_utils::* to recompute it. (Note that I only read the tree part of the code and I wouldn't say I deeply understand it!)

First example that comes to mind is (vim-like) text movement (10j, ...). More generally it would feel closer to std, e.g. char_indices() (if it gave char len instead of offset, but either way works the same).

Thanks for your interest!

cessen commented 1 year ago

What I really wish to see in ropey is a unified Cursor API that works well with tree-sitter:

Okay, I think I understand. A unified cursor API itself definitely doesn't belong in Ropey (Ropey is trying to be lower-level than that). But yeah, a TextInfo-yielding Chunks iterator could make it a little easier to build one yourself on top of Ropey. So that's the way to go if we want to do that.

Having said that, there are a couple of things possibly worth pointing out here:

  1. You can already build a universal cursor on top of the existing Chunks iterator. You just count and track the things you care about in each chunk yourself. The functions in ropey::str_utils provide most of the facilities needed to do that. The only drawback would be that if you have really long lines, such that they span multiple chunks, a TextInfo-based implementation would allow you to skip chunks quickly to find the next one with a line break. But in the common case (more normal-sized lines) you'll need to scan nearly every chunk anyway.
  2. If tree-sitter is your use case, I don't think you need a cursor API at all...? For example, Helix is both tree-sitter based and uses Ropey as its internal text data structure. I'm not super familiar with tree-sitter, but I work pretty closely with the Helix folks and haven't heard of them having any problems getting Ropey and tree-sitter to play well with each other. But maybe I'm still missing something about your particular use case.

In any case, at this point providing a TextInfo yielding chunks iterator would need to be done in one of three ways:

  1. Add it to the existing Chunks iterator as part of next()/prev(), breaking backwards compatibility.
  2. Add it to the existing Chunks iterator as separate methods, which would be awkward among other things.
  3. Add a new iterator type, which would notably expand the API surface area, and be another thing to maintain, and just generally feels a bit messy to me.

Of these options, 1 is what I'd ideally like to do. But that will need to wait for a future Ropey 2.0, and I'm not ready to do that yet because there are a bunch of other things I want to do with Ropey 2.0 as well, so it's a big project.

I do think the idea is good, so it's certainly going on the list of things to do in Ropey 2.0, at the very least. But I guess I'm just not convinced that the benefits are meaningful enough to justify options 2 or 3 in Ropey 1.x.

truchi commented 1 year ago

Hello again,

Thanks for your feedback!

A TextInfo-yielding Chunks iterator could make it a little easier

I'm glad all of this make sense!

  1. ... count and track the things you care about ...

I realized while doing my own toy rope with dream cursor api that line/column tracking might be extra cost if actually unused by caller... They are ways to work around that but it indeed gets specific to use cases and yeah at this point a low level rope might just give indexing to build upon.

But in the common case...

True. I overestimate the performance impact of things! See below

  1. tree-sitter, Helix, your particular use case

I'm not Helix grade yet :) tree-sitter wants you to give its tree edits with both byte and line/col indexes. I was looking for the brrrrrest way to do that, but as you mentioned in #27 this is never going to be the bottleneck in a text app.

1 is what I'd ideally like to do ... [no] 2 or 3 in Ropey 1.x

Agree about 1 & 3. Chunks::info() -> TextInfo or similar does not feel awkward to me. Wait! I just realized it could also be possible to have Chunks::offset() -> TextInfo, right?

Ropey 2.0

😍 That's exciting news!

By the way, is ropey a "real" rope i.e. "the sum of the lengths of all the leaves in its left subtree"? How does that work with 4 children?

Thank you!

Edit: Oh, and LSP uses { start: { line, character }, end: { line, character } } ranges.

cessen commented 1 year ago

tree-sitter wants you to give its tree edits with both byte and line/col indexes.

For submitting edits, I don't see how an iterator over the text would be useful...? At least, in an editor like Helix it just sends the edits as it makes them, which seems like the right approach to me.

Is the use case here the initial sending of the text buffer, before any edits?

line/col indexes. [...] Oh, and LSP uses { start: { line, character }, end: { line, character } } ranges.

As an aside: it continues to baffle me how many popular APIs there are that require line/col offsets rather than just e.g. byte or code unit offsets from the beginning of the text. The latter provides all the information needed for text and edit exchange. And all the former does is cause compatibility problems between software that doesn't agree on what counts as lines and line breaks. Providing the option to specify text offsets by line/col is fine, of course. But requiring line/col offsets only causes problems.

Both LSP and Treesitter were designed and built by people that really should have known better. Alas...

but it indeed gets specific to use cases and yeah at this point a low level rope might just give indexing to build upon.

This is a lesson I learned earlier in Ropey's development (pre 1.0) when I tried to directly support grapheme segmentation. And IMO it's generally a good philosophy for library development: rather than trying to handle every use case directly, instead try to provide a good set of foundations that other people can build on top of.

And that's a big part of why the chunk-related APIs are exposed in Ropey at all: it's a relatively small set of APIs that provide a lot of power for users of Ropey to build on top of. And I'd like to continue in that spirit: trying to keep the API as small as it reasonably can be, but designed well to allow people to use it in flexible ways with some extra code on their side.

Chunks::info() -> TextInfo or similar does not feel awkward to me.

Ah, that's a good point. I'll think about that!

By the way, is ropey a "real" rope i.e. "the sum of the lengths of all the leaves in its left subtree"?

I think so? If a node has four children, and you want the info up to the beginning of the fourth child, you just sum the info of all three children to its left, plus whatever info came from the level above.

I can't imagine any reasonable definition of a rope limiting it to binary trees. Binary trees are simpler, of course, so for pedagogical reasons they're usually taught that way. But for modern hardware it's much more efficient to implement them with high-arity trees and a careful memory layout.

That's exciting news!

Just to be clear: Ropey 2.0 is completely hypothetical at this point. It's something I'd like to do eventually, but I have no estimate as to when that might be. It could be ten years from now, for all I know.

stevefan1999-personal commented 1 year ago

How do we define char_indices? I want to see if we could use this with Nom, I'm trying to implement the traits but I find it quite difficult

cessen commented 1 year ago

@stevefan1999-personal I don't understand what you're asking. What does nom have to do with Ropey? And what traits are you talking about? Is this even on-topic for this issue (it doesn't seem like it)?

stevefan1999-personal commented 1 year ago

@stevefan1999-personal I don't understand what you're asking. What does nom have to do with Ropey? And what traits are you talking about? Is this even on-topic for this issue (it doesn't seem like it)?

Maybe this is a new issue? But I'm not sure if a rope data structure can be treated as a buffered byte stream in the first place.

cessen commented 1 year ago

@stevefan1999-personal I still have no idea what you're talking about. Maybe it is another issue, but it's hard to tell when you don't explain what you're trying to do. All I know right now is that you're trying to do something (no idea what) with nom and Ropey, and I guess that some kind of stream is involved? But you've provided no context for me to be able to understand what you're talking about.

stevefan1999-personal commented 1 year ago

@cessen Yep. I was thinking about the preprocessor thing a while ago, but I think I figured out another way to do token substitution without using rope, but I would still like to see if Ropey can support nom.

cessen commented 1 year ago

@stevefan1999-personal I still don't understand what you're talking about, and you continue to not provide any additional context or explanation. But as far as I can tell, this seems unrelated to this issue, so I'm marking the comments as off-topic.

If you have a real, concrete issue, feel free to open up another issue. But if you do, please take the time to provide full context and fully explain what you're trying to do.