FindStringMatch returns wrong index when using unicode characters

sides-flow commented 6 months ago

Description

When using FindStringMatch the index in the match indicates the offset in the input as []rune, and not as string. Since unicode characters take 2 bytes, the index is incorrect.

Code to reproduce:


package main

import (
    "log"
    "fmt"

    "github.com/dlclark/regexp2"
)

const (
    input = "My first number is 1234. here is some unicode ’’ and my second number is 4567890."
)

var (
    numberRegex = regexp2.MustCompile(`[0-9]+`, regexp2.ExplicitCapture|regexp2.Compiled|regexp2.Singleline)
)

func main() {
    match, err := numberRegex.FindStringMatch(input)
    for i:=0;; match, err = numberRegex.FindNextMatch(match) {
        if err != nil {
            log.Panic(err)
        }
        if match == nil {
            break
        }
        i++

        fmt.Printf("my %d match is '%v'\n", i, input[match.Index : match.Index + match.Length])
    }
}

Expected output:

my 1 match is '1234'
my 2 match is '4567890'

Actual output:

my 1 match is '1234'
my 2 match is ' is 456'

dlclark commented 6 months ago

This is definitely counter-intuitive, but it is the documented, expected behavior:

The internals of regexp2 always operate on []rune so Index and Length data in a Match always reference a position in runes >rather than bytes (even if the input was given as a string). This is a dramatic difference between regexp and regexp2. It's >advisable to use the provided String() methods to avoid having to work with indices.

If I had to do it over again I'd have named them RuneIndex and RuneLength for clarity.

If you need the string indices for some reason then you'll need to convert the rune-based index/length to the string with a function like this:

func convertRuneIndexToStringIndex(s string, runeIndex, runeLength int) (stringIndex, stringLength int) {
    var curStrIdx, startIdx int

    // first get the start index
    for i := 0; i < runeIndex; i++ {
        _, size := utf8.DecodeRuneInString(s[curStrIdx:])
        curStrIdx += size
    }
    startIdx = curStrIdx

    // now get the length
    for i := 0; i < runeLength; i++ {
        _, size := utf8.DecodeRuneInString(s[curStrIdx:])
        curStrIdx += size
    }
    return startIdx, curStrIdx - startIdx
}

This naive version iterates the string every time, so if you were going to need performance to convert these indices for the same input string a bunch I'd use FindRunesMatch so I only have to convert from string to []rune one time and then use an index conversion function like this instead:

func convertRuneIndexToStringIndex(r []rune, runeIndex, runeLength int) (stringIndex, stringLength int) {
    var curStrIdx, startIdx int

    // first get the start index
    for i := 0; i < runeIndex; i++ {
        curStrIdx += utf8.RuneLen(r[i])
    }
    startIdx = curStrIdx

    // now get the length
    for i := runeIndex; i < runeIndex+runeLength; i++ {
        curStrIdx += utf8.RuneLen(r[i])
    }
    return startIdx, curStrIdx - startIdx
}

Benchmarks on my machine show it's about twice as fast and for short input strings it might be faster overall to just convert your string with []rune("string") and then use this version. The performance tradeoffs between them depend on how deep into the string you're searching, how long your string is, how many times you'll be searching, etc.

Hope this helps!

sides-flow commented 6 months ago

I was wondering if this could be done without additional processing (like converting the string to []rune or calculating the index manually). But I was looking further into the code and saw that using FindStringMatch converts the string to []rune anyways, so I might just use FindRunesMatch anyways... Maybe you could add a flag to let the user choose which type of index & length they want (either rune based or string based), this should be backwards compatible... Either way, I managed to solve my problem, thanks!

dlclark / regexp2