apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

TransitionAccessor for NFA: get transitions for a given state via random-access leads to wrong results. #12906

Closed Tony-X closed 9 months ago

Tony-X commented 10 months ago

Description

as part of https://github.com/apache/lucene/pull/12688,
I'm trying to develop an algorithm that can intersect FST and FSA efficiently. For FSA this means we can leverage the sorted transitions of a given state and perform binary search in order to advance to a target.

So the algorithm uses TransitionAccessor and there are three relevant APIs

  int initTransition(int state, Transition t);

  /** Iterate to the next transition after the provided one */
  void getNextTransition(Transition t);

  /**
   * Fill the provided {@link Transition} with the index'th transition leaving the specified state.
   */
  void getTransition(int state, int index, Transition t);

One could call initTransition followed by many getNextTransition to iterate the transitions one by one. Or in my case, Binary search needs to use getTransition to randomly access the middle point.

I debugged for hours and realized for a given test case my algo works as expected but the transitions I got were messed up. The original tests uses quite complicated automatons so I tried to find a small and simple enough to exhibit the behavior. Here is the test

  public void testAutomaton() {
    RegExp regExp = new RegExp("+*.", RegExp.NONE);
    Automaton a = regExp.toAutomaton();
    CompiledAutomaton compiledAutomaton = new CompiledAutomaton(a);
    System.out.println("isFinite: " + compiledAutomaton.finite);

    var byteRunnable = compiledAutomaton.getByteRunnable();
    var transitionAccessor = compiledAutomaton.getTransitionAccessor();
    // dfsAutomaton(byteRunnable, transitionAccessor, 0, "");
    // dumpTransitionsViaNext(byteRunnable, transitionAccessor, 0, new HashSet<>());
    dumpTransitionsViaRA(byteRunnable, transitionAccessor, 0, new HashSet<>());
  }

  void dumpTransitionsViaNext(ByteRunnable a, TransitionAccessor transitionAccessor, int currentState, Set<Integer> seenStates) {
    if (seenStates.contains(currentState)) {
      return;
    }

    seenStates.add(currentState);

    var t = new Transition();
    var numStates = transitionAccessor.initTransition(currentState, t);

    for (int i = 0; i < numStates; i++) {
      transitionAccessor.getNextTransition(t);
      System.out.println("At: src: " + t.source + " [" + t.min +", " + t.max + "] "
              + "dest: " + t.dest + " is dest accept: "
              + (a.isAccept(t.dest) ? "yes" : "no"));
      dumpTransitionsViaNext(a, transitionAccessor, t.dest, seenStates);
    }
  }

  void dumpTransitionsViaRA(ByteRunnable a, TransitionAccessor transitionAccessor, int currentState, Set<Integer> seenStates) {
    if (seenStates.contains(currentState)) {
      return;
    }

    seenStates.add(currentState);

    var t = new Transition();
    var numStates = transitionAccessor.initTransition(currentState, t);

    // transitionAccessor.getTransition(currentState, numStates - 1, t);
    for (int i = 0; i < numStates; i++) {
      transitionAccessor.getTransition(currentState, i, t);
      System.out.println("At: src: " + t.source + " [" + t.min +", " + t.max + "] "
              + "dest: " + t.dest + " is dest accept: "
              + (a.isAccept(t.dest) ? "yes" : "no"));
      dumpTransitionsViaRA(a, transitionAccessor, t.dest, seenStates);
    }
  }

dumpTransitionsViaNext gives expected transitions but dumpTransitionsViaRA produces a mess.

Via next
At: src: 0 arcIdx: 0  [0, 42] dest: 1 is dest accept: yes
At: src: 0 arcIdx: 1  [43, 43] dest: 2 is dest accept: yes
At: src: 2 arcIdx: 0  [0, 42] dest: 1 is dest accept: yes
At: src: 2 arcIdx: 1  [43, 43] dest: 2 is dest accept: yes
At: src: 2 arcIdx: 2  [44, 127] dest: 1 is dest accept: yes
At: src: 2 arcIdx: 3  [194, 223] dest: 7 is dest accept: no
At: src: 7 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 7 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 7 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 7 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 2 arcIdx: 4  [224, 239] dest: 8 is dest accept: no
At: src: 8 arcIdx: 0  [128, 142] dest: 11 is dest accept: no
At: src: 11 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 11 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 11 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 11 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 8 arcIdx: 1  [143, 143] dest: 11 is dest accept: no
At: src: 8 arcIdx: 2  [144, 190] dest: 11 is dest accept: no
At: src: 8 arcIdx: 3  [191, 191] dest: 11 is dest accept: no
At: src: 2 arcIdx: 5  [240, 243] dest: 9 is dest accept: no
At: src: 9 arcIdx: 0  [128, 142] dest: 12 is dest accept: no
At: src: 12 arcIdx: 0  [128, 142] dest: 13 is dest accept: no
At: src: 13 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 13 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 13 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 13 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 12 arcIdx: 1  [143, 143] dest: 13 is dest accept: no
At: src: 12 arcIdx: 2  [144, 190] dest: 13 is dest accept: no
At: src: 12 arcIdx: 3  [191, 191] dest: 13 is dest accept: no
At: src: 9 arcIdx: 1  [143, 143] dest: 12 is dest accept: no
At: src: 9 arcIdx: 2  [144, 190] dest: 12 is dest accept: no
At: src: 9 arcIdx: 3  [191, 191] dest: 12 is dest accept: no
At: src: 2 arcIdx: 6  [244, 244] dest: 10 is dest accept: no
At: src: 10 arcIdx: 0  [128, 142] dest: 14 is dest accept: no
At: src: 14 arcIdx: 0  [128, 142] dest: 16 is dest accept: no
At: src: 16 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 16 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 16 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 16 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 14 arcIdx: 1  [143, 143] dest: 16 is dest accept: no
At: src: 14 arcIdx: 2  [144, 190] dest: 16 is dest accept: no
At: src: 14 arcIdx: 3  [191, 191] dest: 16 is dest accept: no
At: src: 10 arcIdx: 1  [143, 143] dest: 15 is dest accept: no
At: src: 15 arcIdx: 0  [128, 142] dest: 17 is dest accept: no
At: src: 17 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 17 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 17 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 17 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 15 arcIdx: 1  [143, 143] dest: 17 is dest accept: no
At: src: 15 arcIdx: 2  [144, 190] dest: 17 is dest accept: no
At: src: 15 arcIdx: 3  [191, 191] dest: 18 is dest accept: no
At: src: 18 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 18 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 18 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 18 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 0 arcIdx: 2  [44, 127] dest: 1 is dest accept: yes
At: src: 0 arcIdx: 3  [194, 223] dest: 3 is dest accept: no
At: src: 3 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 3 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 3 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 3 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 0 arcIdx: 4  [224, 239] dest: 4 is dest accept: no
At: src: 4 arcIdx: 0  [128, 142] dest: 19 is dest accept: no
At: src: 19 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 19 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 19 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 19 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 4 arcIdx: 1  [143, 143] dest: 19 is dest accept: no
At: src: 4 arcIdx: 2  [144, 190] dest: 19 is dest accept: no
At: src: 4 arcIdx: 3  [191, 191] dest: 19 is dest accept: no
At: src: 0 arcIdx: 5  [240, 243] dest: 5 is dest accept: no
At: src: 5 arcIdx: 0  [128, 142] dest: 20 is dest accept: no
At: src: 20 arcIdx: 0  [128, 142] dest: 21 is dest accept: no
At: src: 21 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 21 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 21 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 21 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 20 arcIdx: 1  [143, 143] dest: 21 is dest accept: no
At: src: 20 arcIdx: 2  [144, 190] dest: 21 is dest accept: no
At: src: 20 arcIdx: 3  [191, 191] dest: 21 is dest accept: no
At: src: 5 arcIdx: 1  [143, 143] dest: 20 is dest accept: no
At: src: 5 arcIdx: 2  [144, 190] dest: 20 is dest accept: no
At: src: 5 arcIdx: 3  [191, 191] dest: 20 is dest accept: no
At: src: 0 arcIdx: 6  [244, 244] dest: 6 is dest accept: no
At: src: 6 arcIdx: 0  [128, 142] dest: 22 is dest accept: no
At: src: 22 arcIdx: 0  [128, 142] dest: 24 is dest accept: no
At: src: 24 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 24 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 24 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 24 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 22 arcIdx: 1  [143, 143] dest: 24 is dest accept: no
At: src: 22 arcIdx: 2  [144, 190] dest: 24 is dest accept: no
At: src: 22 arcIdx: 3  [191, 191] dest: 24 is dest accept: no
At: src: 6 arcIdx: 1  [143, 143] dest: 23 is dest accept: no
At: src: 23 arcIdx: 0  [128, 142] dest: 25 is dest accept: no
At: src: 25 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 25 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 25 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 25 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
At: src: 23 arcIdx: 1  [143, 143] dest: 25 is dest accept: no
At: src: 23 arcIdx: 2  [144, 190] dest: 25 is dest accept: no
At: src: 23 arcIdx: 3  [191, 191] dest: 26 is dest accept: no
At: src: 26 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
At: src: 26 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
At: src: 26 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
At: src: 26 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes

Via RA

isFinite: false
At: src: 0 arcIdx: 0  [0, 42] dest: 0 is dest accept: no
At: src: 0 arcIdx: 1  [43, 43] dest: 0 is dest accept: no
At: src: 0 arcIdx: 2  [44, 127] dest: 0 is dest accept: no
At: src: 0 arcIdx: 3  [194, 223] dest: 0 is dest accept: no
At: src: 0 arcIdx: 4  [224, 239] dest: 0 is dest accept: no
At: src: 0 arcIdx: 5  [240, 243] dest: 0 is dest accept: no
At: src: 0 arcIdx: 6  [244, 244] dest: 0 is dest accept: no

I shared this with @zhaih and he seemed to have an idea of the fix.

Version and environment details

No response

zhaih commented 10 months ago

@Tony-X Nice catch I have a PR to fix: #12909