apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

RegExp::toAutomaton no longer minimizes #13706

Closed ChrisHegarty closed 1 month ago

ChrisHegarty commented 1 month ago

There are a number of optimization in Elasticsearch that depend upon the automaton from a RegExp being total - accepts all strings - [1] [2]. Changes in the upcoming Lucene 10, to not minimize automaton returned by RegExp, has broken the assumption that these optimisations were building upon. At least how they stand today, and I'm not sure how best to replicate the functionality in Lucene 10.

For example this is fine:

RegExp r = new RegExp("@");
Automaton a = r.toAutomaton();
assertTrue(a.isDeterministic());
assertTrue(Operations.isTotal(a));

, while this is not:

RegExp r = new RegExp(".*");
Automaton a = r.toAutomaton();
assertTrue(a.isDeterministic());
assertTrue(Operations.isTotal(a));  // <<< isTotal returns false

Without an API to minimise (since MinimizationOperations is now test-only), I'm not sure how to re-code such optimizations. Or if we should be attempting to provide our own minimize implementation. Or if RegExp should be returning a total automaton for .*?

[1] https://github.com/elastic/elasticsearch/blob/0426e1fbd5dbf1eb9dae07f9af3592569165f5de/x-pack/plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java#L383 [2] https://github.com/elastic/elasticsearch/blob/0426e1fbd5dbf1eb9dae07f9af3592569165f5de/x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/predicate/regex/AbstractStringPattern.java#L30

ChrisHegarty commented 1 month ago

@mikemccand @rmuir - your thoughts here would be helpful, since I'm less familiar with this area of code.

dweiss commented 1 month ago

Minimization is a sure way to prove an automaton accepts all input strings because then the isTotal check is trivial [1]. You could try to trace all possible transitions, starting from the root and a full character range and see if everything in that range is always accepted... Could be fun, implementation-wise.

Looking at the examples, I wonder if this has to be a strict optimization - maybe early checking for common regexp values (.*) would be sufficient and everything else would just run as an automaton (optimized or not)?

If this isn't sufficient then I think you'll have to restore the minimization algorithm on ES side.

[1] https://github.com/cs-au-dk/dk.brics.automaton/blob/master/src/dk/brics/automaton/BasicOperations.java#L583-L590

jpountz commented 1 month ago

In a similar vein, I wonder if RegExp could create more efficient automata out of the box. For instance, it looks like Operations#repeat could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn make RegExp(".*") create an automaton that returns true for Operations#isTotal.

rmuir commented 1 month ago

@ChrisHegarty implementation of isTotal() method requires a minimal DFA. If the automaton is not minimal, it may return false but it should not create a problem.

This is the only place that isTotal() is called in lucene (see the comment): https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java#L181

If you really need to minimize here, can you use something like this as a workaround? https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/util/automaton/AutomatonTestUtil.java#L338-L345

Sorry, I havent thought about this isTotal much to see if there is a more reasonable implementation, just need to think it over.

If we need to improve isTotal, it is definitely not necessary to minimize, e.g. the following only requires determinization + removal of dead states

boolean isTotal(Automaton a) {
  return sameLanguage(a, Automata.makeAnyString());
}
rmuir commented 1 month ago

This is just what i'm mulling over now, relaxing isTotal to no longer require a minimal DFA:

diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..0de4ac013ee 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,15 +864,22 @@ public final class Operations {

   /**
    * Returns true if the given automaton accepts all strings for the specified min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet. The automaton must be deterministic with no transitions to dead states.
    */
   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
-    if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+    // minimal case
+    if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
       return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
     }
-    return false;
+    // deterministic case
+    Automaton a2 = new Automaton();
+    int s = a2.createState();
+    a2.setAccept(s, true);
+    a2.addTransition(s, s, minAlphabet, maxAlphabet);
+    a2.finishState();
+    return sameLanguage(a, a2);
   }

   /**

edit: struggles :)

ChrisHegarty commented 1 month ago

In a similar vein, I wonder if RegExp could create more efficient automata out of the box. For instance, it looks like Operations#repeat could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn make RegExp(".*") create an automaton that returns true for Operations#isTotal.

I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.

      case REGEXP_REPEAT:
+        if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+          return Automata.makeAnyString();
+        } else {
          a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+       }
        break;
rmuir commented 1 month ago

Here's a round two, to prevent any error on NFA or having transitions to dead states:

diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..10103628fad 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,14 +864,25 @@ public final class Operations {

   /**
    * Returns true if the given automaton accepts all strings for the specified min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet. The automaton must be deterministic with no transitions to dead states.
    */
   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
-    if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+    // minimal case
+    if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
       return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
     }
+    // deterministic case
+    if (a.isDeterministic() && hasDeadStatesFromInitial(a) == false) {
+      Automaton a2 = new Automaton();
+      int s = a2.createState();
+      a2.setAccept(s, true);
+      a2.addTransition(s, s, minAlphabet, maxAlphabet);
+      a2.finishState();
+      return sameLanguage(a, a2);
+    }
+    // NFA, or has transitions to dead states, return false
     return false;
   }
ChrisHegarty commented 1 month ago

@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.

Separately, I might also make sense to improve RegExp, as suggested earlier in this issue.

rmuir commented 1 month ago

I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.

      case REGEXP_REPEAT:
+        if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+          return Automata.makeAnyString();
+        } else {
          a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+       }
        break;

let's fix the regexp parser first? It is easier to reason about and less scary than stuff like isTotal and subsetOf.

Previously, regexp parser was calling minimize() on every...parsing...step. I removed this because it is obviously not good. But if we have a simple fix to make it emit better automaton for practical uses, let's do it?

rmuir commented 1 month ago

@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.

need to stare at it some more. I don't like that it uses some stuff such as sameLanguage which has a TODO to move to tests/test-framework and is currently only used by tests.

And since we are checking for "total", we definitely don't need to do subsetOf twice: subsetOf(a2, a1) && subsetOf(a1, a2).

and I don't like that subsetOf is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.

rmuir commented 1 month ago

@ChrisHegarty I created draft PR, but I am still not happy with it yet.

rmuir commented 1 month ago

See PR: #13707, I took a different approach which solves the practical problem without doing scary stuff.

   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
+    // allows some "fuzziness" to detect non-minimal forms such as those from RegExp
     if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
-      return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
+      int state = t.dest;
+      if (t.min == minAlphabet
+          && t.max == maxAlphabet
+          && a.isAccept(state)
+          && a.getNumTransitions(state) == 1) {
+        a.getTransition(state, 0, t);
+        return t.dest == state && t.min == minAlphabet && t.max == maxAlphabet;
+      }
     }
     return false;
   }
dweiss commented 1 month ago

I like the brevity of using sameLanguage! :)

I keep trying to find a counterexample to the assertion that a deterministic, total automaton must accept full language in each state reachable from root (there may be more than one transition but they must cover the full minAlphabet..maxAlphabet range, always ending in an accepting state somewhere.

If so, it should be possible to implement isTotal as a full traversal of the automaton in O(num states)? So something like this would also return true:

      Automaton c =
          Operations.repeat(
              Operations.union(
                  Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
                  Automata.makeCharRange(101, Character.MAX_CODE_POINT)));

      System.out.println(c.toDot());

image

rmuir commented 1 month ago

I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?

As far as your counterexample, it is actually difficult to create such an automaton, you managed it with union! e,g, if you just create a state and add several ranges instead of one big range, they will be collapsed into one single range when you finishState: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L232

That's why I thought, there is something to be said for a very simple, constant-time check that will be practical as opposed to perfect: it will work for the stuff coming out of regex parser, or for "normal" stuff coming from the api (e.g. repeat). for that it needs to check two states (or same state twice) instead of one.

But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.

rmuir commented 1 month ago

I think the "full traversal" suggested by Dawid here would be very fast. The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states), but we have some helpers for that already in Operations?

dweiss commented 1 month ago

I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?

I agree - I don't think it's a good practical replacement solution, but it's a very elegant theoretical one. :)

But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.

The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states),

I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?

rmuir commented 1 month ago

I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?

Yes, this one is very important actually, you get 3 states with Operations.repeat(Automata.makeAnyChar()) and one is unreachable.

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.

lemme try adding your test, that is helpful as I was having a tough time coming up with "varieties" to test. I will take a stab at it. It would be great to fix the javadoc to not require minimization to call this function.

dweiss commented 1 month ago

You could probably create an automaton containing states with an insane number of outgoing transitions, for example one transition for each character... then resolving that such a state actually covers the full min..max range, with no gaps, could be costly. The only thing I can think of is sorting transition ranges and checking for continuity (and min/max)... this may get expensive.

Whether such unrealistic automata can ever occur in reality (as a product of the set of operations we make available) is another question...

dweiss commented 1 month ago

Like so:

      Automaton c =
          Operations.repeat(
              Operations.union(
                  List.of(
                      Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
                      Automata.makeChar(101),
                      Automata.makeChar(102),
                      Automata.makeChar(103),
                      Automata.makeCharRange(104, Character.MAX_CODE_POINT))));

image

rmuir commented 1 month ago

I don't think it is too bad because transitions are already sorted and collapsed for each state when you call finishState(). For such an automaton you paid the price at construction time :)

But when you "iterate transitions" in order (0..numTransitions) to resolve a state, you are walking them in sorted order.

rmuir commented 1 month ago

@dweiss i coded your idea up like this:

  /** 
   * Returns true if the given automaton accepts all strings for the specified min/max range of the
   * alphabet. 
   */   
  public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
    BitSet states = getLiveStates(a);
    Transition spare = new Transition();
    for (int state = states.nextSetBit(0); state >= 0; state = states.nextSetBit(state + 1)) {
      // all reachable states must be accept states
      if (a.isAccept(state) == false) return false;
      // all reachable states must contain transitions covering minAlphabet-maxAlphabet
      int previousLabel = minAlphabet - 1;
      for (int transition = 0; transition < a.getNumTransitions(state); transition++) {
        a.getTransition(state, transition, spare);
        // no gaps are allowed
        if (spare.min > previousLabel + 1) return false;
        previousLabel = spare.max;
      }
      if (previousLabel < maxAlphabet) return false;
      if (state == Integer.MAX_VALUE) {
        break; // or (state+1) would overflow
      }
    }
    // we've checked all the states, if its non-empty, its total
    return a.getNumStates() > 0;
  }

Only surprise was the last line, so the logic is:

rmuir commented 1 month ago

https://github.com/apache/lucene/pull/13707/commits/ce4cce05538f79418ba4616dda89755975617762

rmuir commented 1 month ago
  • automaton must not be empty

where "empty" means, at least one reachable state. Automaton can have all dead states, and that doesn't make it total :)

dweiss commented 1 month ago

Yes, I like it!

I had some time to think about it before I went to bed and this implementation is actually a direct rollout of the definition of accepted language equivalence for deterministic automata - just what you mentioned at the beginning. Two equivalent (deterministic) automata must accept the same set of symbols from any state reachable for any input starting at the initial state. The automaton we compare against just happens to be repeat(anyCharacter()), so in any reachable state of automaton A we compare against the only state in automaton B - a self-connected state accepting all symbols. Consistent with the conditions you mentioned.

I'm glad this worked out to be so elegant and thank you for the implementation.

ChrisHegarty commented 1 month ago

💙

mikemccand commented 1 month ago

and I don't like that subsetOf is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.

I am trying to review this PR, but got distracted / rat-holed into this statement lol:

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced. Yes it has embedded for loops (one for number of transitions leaving n1, the other for number of transitions leaving n2), but that is basically doing a merge sort so its cost is linear O(max(n1, n2)).

In total I think its cost is actually O(totalTransitionCountInA1)? Anyway, this is a (fun) distraction ... I'll try to actually review the change ;)

mikemccand commented 1 month ago

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say).

+1, heh.

mikemccand commented 1 month ago
  • automaton must not be empty

Ahh that last line was sneaky :)

rmuir commented 1 month ago

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced.

I think this is only one of the scarier parts about it. The other scary part is that it may throw IllegalArgumentException, or in some cases AssertionError (if asserts are enabled, if not... UB?).

For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: https://github.com/apache/lucene/pull/13708

mikemccand commented 1 month ago

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced.

I think this is only one of the scarier parts about it. The other scary part is that it may throw IllegalArgumentException, or in some cases AssertionError (if asserts are enabled, if not... UB?).

For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708

Yeah +1 to move it to test only!

rmuir commented 1 month ago

Closed by #13707