Anders429 / word_filter

A Word Filter for filtering text.
Apache License 2.0
1 stars 0 forks source link

Repetition Priority #79

Closed Anders429 closed 3 years ago

Anders429 commented 3 years ago

When handling repetitions, transitions along normal paths should be given priority. For example, a WordFilter over the word "foo" can match the input "fooo" in two different ways: the two letter 'o's can be taken first, or one 'o', the repetition, and then the second 'o' can be taken.

This presents a couple of problems. The main one is in trying to resolve #78. The input "foo or bar" (from that issue's example) becomes impossible to detect correctly, because of the ambiguity in repetition resolution. Another issue is ID blowup on inputs with an arbitrary amount of repeated characters (e.g. "fooooooooooo...").

There should be some sort of priority in traversing repetitions. A repetition should not be traversed if there is a valid c_transition that can be traversed first. Care needs to be taken to make sure that the valid c_transition resolves to a state that has a take_repetition flag set (otherwise it would be a larger grapheme). This way, both of the above issues can be resolved.

Anders429 commented 3 years ago

This wasn't actually too hard to solve, although it unfortunately adds more complexity to the already-monstrous State::transitions() method. While traversing along a char, I've added in a "look-ahead" check before creating a repetition transition. If the next state also has an edge on the same char, the next state has the into_separator flag, and the next-next state (along the same char path) has the take_separator flag, then we do not create the repetition transition.

This works because it ensures that the repetition can still happen in the future, which means that we can defer it now. This also fixes #78, allowing the test case there to work, since the repetitions are handled last.

Side note: it may be better to check that the next state's into_separator flag matches the current state's, rather than just checking that it's on (although it should be on if we are entering a separator at all), and to check that the next-next state's take_separator flag matches the next state's take_separator flag, just to ensure that the state's are indeed equal. This really shouldn't matter with the current state of how the PDA is generated, but it may matter in the future, so I'll leave this note here just in case problems arise later, so that this solution can be attempted.