INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
106 stars 52 forks source link

Reverse and forward matching use slightly different regex syntax #440

Open jan-niestadt opened 1 year ago

jan-niestadt commented 1 year ago

See e.g. https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html#COMPLEMENT :+1:

The reserved characters used in the (enabled) syntax must be escaped with backslash () or double-quotes ("..."). (In contrast to other regexp syntaxes, this is required also in character classes.)

Maybe we can use Lucene's regex engine there as well? Otherwise we'd have to try to translate the regex to the other engine's syntax, which could be challenging.

Not a huge issue in practice, but could in rare cases lead to baffling matching bugs...

If we want to enable optional features in Lucene's regex engine such as the complement operator ~, this becomes more of a problem. We've enabled this for relations matches now, but those never use the forward index.

jan-niestadt commented 1 year ago

ChatGPT gives the following suggestion. Not all of it works, but the approach (with some updates/fixes to the code) seems viable:


Lucene's regular expression support is provided through the RegExp class in the org.apache.lucene.util.automaton package. Here's an example of how you might use Lucene's regex engine to find a pattern in a string and iterate over the matches, accessing captured groups:

import org.apache.lucene.util.automaton.RegExp;
import org.apache.lucene.util.automaton.Transition;
import org.apache.lucene.util.automaton.TransitionIterator;

public class LuceneRegexExample {

    public static void main(String[] args) {
        // Your input string
        String input = "The quick brown fox jumps over the lazy dog";

        // Your regular expression pattern
        String pattern = "\\b(\\w+)(\\s+\\w+)*\\b";  // Example pattern to match words

        // Compile the regular expression
        RegExp re = new RegExp(pattern);

        // Get the automaton for the regular expression
        // You can also use re.toAutomaton() if you don't need to modify the automaton further
        // This can be useful for optimizing the regex compilation if it's used multiple times
        RegExp.Automaton automaton = re.toAutomaton();

        // Iterate over transitions to find matches
        TransitionIterator iterator = automaton.getInitialState().getTransitions();

        int currentIndex = 0;

        while (iterator.hasNext()) {
            Transition transition = iterator.next();
            int nextState = transition.getDest().getNumber();

            // Check if the transition is a match
            if (transition.getMin() <= currentIndex && currentIndex < transition.getMax()) {
                System.out.println("Match found at index: " + currentIndex);

                // Access captured groups if needed
                String matchedText = input.substring(currentIndex, currentIndex + (transition.getMax() - transition.getMin()));
                System.out.println("Matched text: " + matchedText);

                // Access captured groups
                for (int group = 1; group <= re.numberOfGroups(); group++) {
                    int start = re.start(group);
                    int end = re.end(group);

                    if (start != -1 && end != -1) {
                        String capturedGroup = input.substring(start, end);
                        System.out.println("Group " + group + ": " + capturedGroup);
                    }
                }

                // Move the current index to the next character after the match
                currentIndex = currentIndex + (transition.getMax() - transition.getMin());
            } else {
                // Move to the next character if there is no match
                currentIndex++;
            }
        }
    }
}

In this example, we use org.apache.lucene.util.automaton.RegExp to compile the regular expression pattern, and then we obtain the automaton for the regular expression using re.toAutomaton(). We iterate over the transitions of the automaton and check for matches, accessing captured groups as needed. The RegExp class provides methods like start(group) and end(group) to get the start and end indices of captured groups.