Anders429 / word_filter

A Word Filter for filtering text.
Apache License 2.0
1 stars 0 forks source link

const construction #37

Closed Anders429 closed 3 years ago

Anders429 commented 3 years ago

WordFilters are constructed at runtime, which is suboptimal due to the immutable nature of a WordFilter. Most users will want a static WordFilter, which is only doable with something like once_cell or lazy_static and still requires runtime initialization.

The main thing holding WordFilterBuilder::build() from being const` is the heap allocations that are required. There is some effort for const heap allocation, but it is very far off and not something I would like to sit around and wait for.

An alternative is to take the same approach that phf does, namely compile-time construction using either phf_codegen (in a build.rs file) or phf_macros (as a procedural macro in a source file). This would require changing the entire structure of a WordFilter (although I think accessing it through its query methods can remain mostly untouched, as long as Walker can interface with the new structure). The following structure would work, I think:

enum Type {
    Standard,
    Match(&'static str),
    Exception(&'static str),
    Return,
}

struct Node {
    r#type: Type,
    children: fn(char) -> &'static Node, // A closure with a generated match statement.
    alias_subgraphs: &'static [(&'static Node, &'static Node)],
    grapheme_subgraphs: &'static [(&'static Node, &'static Node)],

}

pub enum RepeatedCharacterMatchMode {
    AllowRepeatedCharacters,
    DisallowRepeatedCharacters,
}

pub struct WordFilter {
    root: Node,
    separator_root: Node,
    nodes: &'static [Node],
    repeated_character_match_mode: RepeatedCharacterMatchMode,
    censor: fn(&str) -> String,
}

I believe the static self-references should work fine (my limited testing of them on the Rust playground indicated it would). This allows for all of the Nodes to be stored on the stack instead of the heap, as they currently are. This is ideal, since we don't ever need to dynamically manage Nodes anyway.

The downside is that construction of a WordFilter will not be able to be done dynamically at all. If any user wants to create a WordFilter based on some variables (perhaps a user config file or something), it won't be possible. However, I don't see any way around this, as the Nodes need to be able to reference each other, perhaps in circuitous ways, and that can't be done at both compile and runtime.

One more note about implementation: the Nodes in WordFilter could be stored in a fixed-sized array, rather than a slice, using const generics. I'm not sure if that has any sort of benefit whatsoever, as I would assume a static slice would just be optimized anyway, but it is worth looking into.

Anders429 commented 3 years ago

One more note about implementation: the Nodes in WordFilter could be stored in a fixed-sized array, rather than a slice, using const generics. I'm not sure if that has any sort of benefit whatsoever, as I would assume a static slice would just be optimized anyway, but it is worth looking into.

Looks like storing a fixed-sized array saves from having to store the length, since it is known at compile time. Given that hashbrown will still be needed, and that the newest version has a pretty high MSRV, it might be worth raising the MSRV to allow for the const generics.

In terms of code generation vs. procedural macro, I prefer code generation in a build.rs file. I'm going to work on that first. Maybe the proc_macro will be easy to do once I get all the details down, at which point both can be offered via separate crates. Borrowing from phf's structure, I see the crate structure being:

Anders429 commented 3 years ago

After thinking about it all day today, I'm coming closer to a concrete solution. A proc_macro likely won't work in other situations where code generation also will not work, due to the self-referencing that will still be occurring. An inline builder with unsafe trickery on the references will not be practical either, due to the user needing to know the size of the WordFilter's node list before it is built, which can be difficult to determine. Therefore, I'm only going to focus on code generation for now.

The things needed only for code generation are:

The generated code will be written in structs and enums which are:

And the usage of that code will involve traversing through those Nodes using:

Unfortunately, Node and Type will need to be made public for the code generation to work. Also, all fields of Node and WordFilter will need to be made public. Those can be hidden (as they are in phf), with a note saying not to ever use them and that they aren't guaranteed in the API stability, and that's about the best we can really do.

The question still remains: should the code generation be in a separate crate? Or in the same crate?

Anders429 commented 3 years ago

When constructing WordFilters in a build.rs, specifying the censor is a bit difficult, since there isn't a simple way to map closures to string representations of themselves. It might make things a lot easier if users were to specify the censor closure on calls to the censor() method.

Anders429 commented 3 years ago

After experimenting with this a bit, I've determined a couple of things:

For now, I have also dropped the RepeatedCharacterMatchMode setting. I'm not sure how it would best fit in here, since either way doesn't really change the internal structure of the WordFilter. I don't think many users would want to disable repeated character matching anyway, so for now I'm making it be the only mode available.