const construction - Githubissues

Anders429 commented 3 years ago

WordFilters are constructed at runtime, which is suboptimal due to the immutable nature of a WordFilter. Most users will want a static WordFilter, which is only doable with something like once_cell or lazy_static and still requires runtime initialization.

The main thing holding WordFilterBuilder::build() from being const` is the heap allocations that are required. There is some effort for const heap allocation, but it is very far off and not something I would like to sit around and wait for.

An alternative is to take the same approach that phf does, namely compile-time construction using either phf_codegen (in a build.rs file) or phf_macros (as a procedural macro in a source file). This would require changing the entire structure of a WordFilter (although I think accessing it through its query methods can remain mostly untouched, as long as Walker can interface with the new structure). The following structure would work, I think:

enum Type {
    Standard,
    Match(&'static str),
    Exception(&'static str),
    Return,
}

struct Node {
    r#type: Type,
    children: fn(char) -> &'static Node, // A closure with a generated match statement.
    alias_subgraphs: &'static [(&'static Node, &'static Node)],
    grapheme_subgraphs: &'static [(&'static Node, &'static Node)],

}

pub enum RepeatedCharacterMatchMode {
    AllowRepeatedCharacters,
    DisallowRepeatedCharacters,
}

pub struct WordFilter {
    root: Node,
    separator_root: Node,
    nodes: &'static [Node],
    repeated_character_match_mode: RepeatedCharacterMatchMode,
    censor: fn(&str) -> String,
}

I believe the static self-references should work fine (my limited testing of them on the Rust playground indicated it would). This allows for all of the Nodes to be stored on the stack instead of the heap, as they currently are. This is ideal, since we don't ever need to dynamically manage Nodes anyway.

The downside is that construction of a WordFilter will not be able to be done dynamically at all. If any user wants to create a WordFilter based on some variables (perhaps a user config file or something), it won't be possible. However, I don't see any way around this, as the Nodes need to be able to reference each other, perhaps in circuitous ways, and that can't be done at both compile and runtime.

One more note about implementation: the Nodes in WordFilter could be stored in a fixed-sized array, rather than a slice, using const generics. I'm not sure if that has any sort of benefit whatsoever, as I would assume a static slice would just be optimized anyway, but it is worth looking into.

Anders429 commented 3 years ago

One more note about implementation: the Nodes in WordFilter could be stored in a fixed-sized array, rather than a slice, using const generics. I'm not sure if that has any sort of benefit whatsoever, as I would assume a static slice would just be optimized anyway, but it is worth looking into.

Looks like storing a fixed-sized array saves from having to store the length, since it is known at compile time. Given that hashbrown will still be needed, and that the newest version has a pretty high MSRV, it might be worth raising the MSRV to allow for the const generics.

In terms of code generation vs. procedural macro, I prefer code generation in a build.rs file. I'm going to work on that first. Maybe the proc_macro will be easy to do once I get all the details down, at which point both can be offered via separate crates. Borrowing from phf's structure, I see the crate structure being:

word_filter: Contains the actual struct definitions. All users will depend on this crate at minimum.
word_filter_codegen: The logic for generating the WordFilter definition at compile-time via a build.rs file.
word_filter_macro: The logic for generating the `WordFilter`` definition at compile-time via a proc_macro.
word_filter_shared: Possibly any shared logic.

Anders429 commented 3 years ago

After thinking about it all day today, I'm coming closer to a concrete solution. A proc_macro likely won't work in other situations where code generation also will not work, due to the self-referencing that will still be occurring. An inline builder with unsafe trickery on the references will not be practical either, due to the user needing to know the size of the WordFilter's node list before it is built, which can be difficult to determine. Therefore, I'm only going to focus on code generation for now.

The things needed only for code generation are:

censor module
Some kind of Code Builder struct.

The generated code will be written in structs and enums which are:

WordFilter and RepeatedCharacterMatchMode
Node and Type

And the usage of that code will involve traversing through those Nodes using:

Walker and Status.
The WordFilter methods.

Unfortunately, Node and Type will need to be made public for the code generation to work. Also, all fields of Node and WordFilter will need to be made public. Those can be hidden (as they are in phf), with a note saying not to ever use them and that they aren't guaranteed in the API stability, and that's about the best we can really do.

The question still remains: should the code generation be in a separate crate? Or in the same crate?

Anders429 commented 3 years ago

When constructing WordFilters in a build.rs, specifying the censor is a bit difficult, since there isn't a simple way to map closures to string representations of themselves. It might make things a lot easier if users were to specify the censor closure on calls to the censor() method.

Anders429 commented 3 years ago

After experimenting with this a bit, I've determined a couple of things:

const generics are required for this to work, due to the evaluation of self-references. The compiler can't evaluate self-references into a slice (even if it is a static slice), but it can do it with a fixed-sized static array. This means that creating a WordFilter at run-time is a no-go, and it will only be doable at compile time (which is fine, IMO).
It is best to separate the code generation and the resulting structs into two separate crates. One will be depended upon at build-time, the other depended upon in the main dependency list. This is the pattern that many crates use, not just phf. For example, prost and build-info have their associated prost-build and built-info-build crates.
Aside from the code generation, basically the rest of the crate (meaning the walking) functions identically to how it did before.

For now, I have also dropped the RepeatedCharacterMatchMode setting. I'm not sure how it would best fit in here, since either way doesn't really change the internal structure of the WordFilter. I don't think many users would want to disable repeated character matching anyway, so for now I'm making it be the only mode available.

Anders429 / word_filter

const construction #37