cessen / str_indices

Count and convert between various ways of indexing utf8 string slices.
Apache License 2.0
16 stars 5 forks source link

Making `lines::to_byte_idx()` also return the line count #19

Closed noib3 closed 1 year ago

noib3 commented 1 year ago

Would it be possible to make lines::to_byte_idx() return the number of line breaks if the line index is past the end of the slice?

My specific use case is that I have two slices, for example "a\nb" and "\nc", and I need to get the byte offset of a line as if the two slices were contiguous, i.e. "a\nb\nc". At the moment this can require a second call to count_breaks():

use str_indices::lines_lf;

let a = "a\nb";
let b = "\nc";
let line_offset = 2;

let mut byte_offset = lines_lf::to_byte_idx(a, line_offset);

if byte_offset == a.len() {
    // I'm assuming this was already tracked by `to_byte_idx()`, but it's not
    // returned so I need to compute it again.
    let line_breaks_in_a = lines_lf::count_breaks(a);

    byte_offset += lines_lf::to_byte_idx(b, line_offset - line_breaks_in_a);
}
cessen commented 1 year ago

I can understand wanting this for efficiency reasons, but in general I'm trying to strike a balance between performance and a small, simple API with str_indices. This is also why, for example, there aren't any functions for directly converting between chars and lines, even though that could potentially be more efficient than converting from char to byte and then byte to line.

So I guess the short answer is, unless this actually becomes a bottleneck for someone, I'd prefer to keep the API as-is and just have client code use two function calls. If it does become a bottleneck, though, please let me know!

noib3 commented 1 year ago

This came up recently in a PR for crop that changes the leaves of the tree from Strings to gap buffers (the two slices that I mentioned are the left and right segments of each leaf), and not having this adds a 40-80% performance hit to the Lines iterator and a few line-oriented APIs that I wanted to avoid, so it's a bit of a bottleneck.

With that said I also completely understand wanting to keep a clean, minimal API and I wouldn't expect you to change it just to accomodate my specific use case.

Thanks for the input, feel free to close this.

cessen commented 1 year ago

Ah, this is for crop! Very cool. And yeah, up to 80% is a pretty hefty cost to pay, indeed.

Having said that, if I were doing this (a rope with gap buffers at the leaves) I would maintain a line break count of the left side so that the code can just immediately jump to the correct side of the gap without scanning the text. It's a little extra space, but not much (you should be able to do it with a u16 unless your leaves are huge), and I believe would improve the efficiency even over what you're proposing.

noib3 commented 1 year ago

Yeah that would actually increase the performance of the APIs I mentioned, but I wanted to avoid that if possible.

It's not the extra space that concerns me, it's that a) it complicates the code a lot, and b) the left line break count has to be recomputed every time the gap moves to a new offset. I don't think that's worth the tradeoff.

cessen commented 1 year ago

Totally fair if you want to avoid it! There may well be complications I'm not aware of.

the left line break count has to be recomputed every time the gap moves to a new offset.

While that's true, you don't have to re-scan the entire left side, only the part of the text that was moved, and then add/subtract as needed. So the cost would be linear with the distance of the move. So, personally, I think the cost is pretty reasonable. And it makes queries (which are typically the more common operations) faster.

But indeed it is a trade off, not a win across the board.

noib3 commented 1 year ago

While that's true, you don't have to re-scan the entire left side, only the part of the text that was moved, and then add/subtract as needed. So the cost would be linear with the distance of the move.

Yep yep that's how I would implement it if I were to do it. In fact it's not even O(len_moved), it's O(len_segment/2) because if you know how many line breaks there are in each segment you can just recompute the shorter chunk of the side that's getting moved and get the other chunk by subtraction. I'll try to do a POC but I strongly suspect that that's gonna come up in places were I'd really prefer not to think about it.

Anyways, thanks a lot for considering it ;)