Anders429 / word_filter

A Word Filter for filtering text.
Apache License 2.0
1 stars 0 forks source link

Refactor internal push-down automaton #44

Closed Anders429 closed 3 years ago

Anders429 commented 3 years ago

This really simplifies the internal structure of a WordFilter, eliminating the confusing multi-stack system and simplifying a bunch of stuff. The number of states produced in code generation is reduced due to the combination of return states.

Important changes to terminology: to match with push-down automaton terminology, nodes have been renamed to states and walkers have been renamed to instantaneous descriptions. This should hopefully make things less confusing for newcomers looking at the codebase.

This also provides a fix for #43, as it now only stores word strings, not exception strings.

It should be noted that this PR provides significant performance improvements for certain edge cases, including the case of having many separators in a row.

Anders429 commented 3 years ago

This currently results in some significantly long compile times. I have an idea to try out for reducing compile times, which involves adding another possible stack value type corresponding to repetition targets, which allows the number of stored alias references to be cut roughly in half.

codecov-commenter commented 3 years ago

Codecov Report

Merging #44 (3edd923) into 0.6.0 (2ee383e) will increase coverage by 1.74%. The diff coverage is 84.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##            0.6.0      #44      +/-   ##
==========================================
+ Coverage   85.75%   87.50%   +1.74%     
==========================================
  Files           8        9       +1     
  Lines         744      560     -184     
==========================================
- Hits          638      490     -148     
+ Misses        106       70      -36     
Impacted Files Coverage Δ
src/censor.rs 100.00% <ø> (ø)
tests/integration.rs 100.00% <ø> (ø)
word_filter_codegen/src/state.rs 72.34% <72.34%> (ø)
src/pda.rs 81.81% <81.81%> (ø)
word_filter_codegen/src/pda.rs 85.83% <85.83%> (ø)
src/lib.rs 92.00% <93.02%> (+12.00%) :arrow_up:
word_filter_codegen/src/lib.rs 81.60% <100.00%> (-3.58%) :arrow_down:
word_filter_codegen/src/type.rs 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2ee383e...3edd923. Read the comment docs.

Anders429 commented 3 years ago

Update on the previous comment: in order for repetitions with varying aliases to work, aliases must be computed over repetitions as well. That means that states often have aliases in common with the previous state, essentially doubling the number of aliases that must be compiled. This has a significant impact on compile times, and removing these doubled aliases cuts it roughly in half.

However, without the doubled aliases, the censor_repeated_alias test doesn't pass, since the alias has no repeated path anymore. This must be avoided, since repeated aliases is an important feature.

An idea is to add a separate Target stack value, containing a &State with the state that the path must pass through, and adding that to the stack within an ε-transition back to the previous state. The new criteria for an accepting ID would be an empty stack and an accepting state type. However, this can cause ID blow-up in cases where there are many many separators between characters, since the repetition is handled every single step, causing exponential increase and terrible complexity.

A solution needs to be reached to prevent this blow-up but also reduce the number of aliases. I think the idea of making a repetition an ε-transition is the right way to go. There has to be some way to reduce blow-up though. It needs to make sure it isn't recursively adding repetitions, as well as ensure it isn't going down the wrong path for a long time.

For now, the current situation is fine, although larger graphs (with like 2000 states) take, like, 3 and a half minutes to compile, which isn't exactly the best user experience. It's also kinda wasteful on space.

Anders429 commented 3 years ago

The last commit should resolve the issue. A target state is pushed to the stack, and on the next transition, if the top stack value is a target, then the only paths that can be taken are to a state with no repetition (i.e. a grapheme internal state), a state that is equal to the target state, or an alias whose return state is equal to the target state.

This is equivalent to the previous method of calculating aliases over repetitions, but the benefit is that the aliases don't have to be stored twice. Compilation times are roughly cut in half as a result, which is great for bigger generated WordFilters.