Closed sides-flow closed 6 months ago
This is definitely counter-intuitive, but it is the documented, expected behavior:
The internals of regexp2 always operate on []rune so Index and Length data in a Match always reference a position in runes >rather than bytes (even if the input was given as a string). This is a dramatic difference between regexp and regexp2. It's >advisable to use the provided String() methods to avoid having to work with indices.
If I had to do it over again I'd have named them RuneIndex
and RuneLength
for clarity.
If you need the string indices for some reason then you'll need to convert the rune-based index/length to the string with a function like this:
func convertRuneIndexToStringIndex(s string, runeIndex, runeLength int) (stringIndex, stringLength int) {
var curStrIdx, startIdx int
// first get the start index
for i := 0; i < runeIndex; i++ {
_, size := utf8.DecodeRuneInString(s[curStrIdx:])
curStrIdx += size
}
startIdx = curStrIdx
// now get the length
for i := 0; i < runeLength; i++ {
_, size := utf8.DecodeRuneInString(s[curStrIdx:])
curStrIdx += size
}
return startIdx, curStrIdx - startIdx
}
This naive version iterates the string every time, so if you were going to need performance to convert these indices for the same input string a bunch I'd use FindRunesMatch
so I only have to convert from string
to []rune
one time and then use an index conversion function like this instead:
func convertRuneIndexToStringIndex(r []rune, runeIndex, runeLength int) (stringIndex, stringLength int) {
var curStrIdx, startIdx int
// first get the start index
for i := 0; i < runeIndex; i++ {
curStrIdx += utf8.RuneLen(r[i])
}
startIdx = curStrIdx
// now get the length
for i := runeIndex; i < runeIndex+runeLength; i++ {
curStrIdx += utf8.RuneLen(r[i])
}
return startIdx, curStrIdx - startIdx
}
Benchmarks on my machine show it's about twice as fast and for short input strings it might be faster overall to just convert your string with []rune("string")
and then use this version. The performance tradeoffs between them depend on how deep into the string you're searching, how long your string is, how many times you'll be searching, etc.
Hope this helps!
I was wondering if this could be done without additional processing (like converting the string
to []rune
or calculating the index manually).
But I was looking further into the code and saw that using FindStringMatch
converts the string
to []rune
anyways, so I might just use FindRunesMatch
anyways...
Maybe you could add a flag to let the user choose which type of index & length they want (either rune based or string based), this should be backwards compatible...
Either way, I managed to solve my problem, thanks!
Description
When using
FindStringMatch
the index in the match indicates the offset in the input as[]rune
, and not asstring
. Since unicode characters take 2 bytes, the index is incorrect.Code to reproduce:
Expected output:
Actual output: