Closed noib3 closed 1 year ago
I can understand wanting this for efficiency reasons, but in general I'm trying to strike a balance between performance and a small, simple API with str_indices
. This is also why, for example, there aren't any functions for directly converting between char
s and lines, even though that could potentially be more efficient than converting from char
to byte and then byte to line.
So I guess the short answer is, unless this actually becomes a bottleneck for someone, I'd prefer to keep the API as-is and just have client code use two function calls. If it does become a bottleneck, though, please let me know!
This came up recently in a PR for crop that changes the leaves of the tree from String
s to gap buffers (the two slices that I mentioned are the left and right segments of each leaf), and not having this adds a 40-80% performance hit to the Lines
iterator and a few line-oriented APIs that I wanted to avoid, so it's a bit of a bottleneck.
With that said I also completely understand wanting to keep a clean, minimal API and I wouldn't expect you to change it just to accomodate my specific use case.
Thanks for the input, feel free to close this.
Ah, this is for crop
! Very cool. And yeah, up to 80% is a pretty hefty cost to pay, indeed.
Having said that, if I were doing this (a rope with gap buffers at the leaves) I would maintain a line break count of the left side so that the code can just immediately jump to the correct side of the gap without scanning the text. It's a little extra space, but not much (you should be able to do it with a u16
unless your leaves are huge), and I believe would improve the efficiency even over what you're proposing.
Yeah that would actually increase the performance of the APIs I mentioned, but I wanted to avoid that if possible.
It's not the extra space that concerns me, it's that a) it complicates the code a lot, and b) the left line break count has to be recomputed every time the gap moves to a new offset. I don't think that's worth the tradeoff.
Totally fair if you want to avoid it! There may well be complications I'm not aware of.
the left line break count has to be recomputed every time the gap moves to a new offset.
While that's true, you don't have to re-scan the entire left side, only the part of the text that was moved, and then add/subtract as needed. So the cost would be linear with the distance of the move. So, personally, I think the cost is pretty reasonable. And it makes queries (which are typically the more common operations) faster.
But indeed it is a trade off, not a win across the board.
While that's true, you don't have to re-scan the entire left side, only the part of the text that was moved, and then add/subtract as needed. So the cost would be linear with the distance of the move.
Yep yep that's how I would implement it if I were to do it. In fact it's not even O(len_moved)
, it's O(len_segment/2)
because if you know how many line breaks there are in each segment you can just recompute the shorter chunk of the side that's getting moved and get the other chunk by subtraction. I'll try to do a POC but I strongly suspect that that's gonna come up in places were I'd really prefer not to think about it.
Anyways, thanks a lot for considering it ;)
Would it be possible to make
lines::to_byte_idx()
return the number of line breaks if the line index is past the end of the slice?My specific use case is that I have two slices, for example
"a\nb"
and"\nc"
, and I need to get the byte offset of a line as if the two slices were contiguous, i.e."a\nb\nc"
. At the moment this can require a second call tocount_breaks()
: