Closed Anders429 closed 3 years ago
One more note about implementation: the Nodes in WordFilter could be stored in a fixed-sized array, rather than a slice, using const generics. I'm not sure if that has any sort of benefit whatsoever, as I would assume a static slice would just be optimized anyway, but it is worth looking into.
Looks like storing a fixed-sized array saves from having to store the length, since it is known at compile time. Given that hashbrown
will still be needed, and that the newest version has a pretty high MSRV, it might be worth raising the MSRV to allow for the const generics.
In terms of code generation vs. procedural macro, I prefer code generation in a build.rs file. I'm going to work on that first. Maybe the proc_macro will be easy to do once I get all the details down, at which point both can be offered via separate crates. Borrowing from phf
's structure, I see the crate structure being:
word_filter
: Contains the actual struct definitions. All users will depend on this crate at minimum.word_filter_codegen
: The logic for generating the WordFilter
definition at compile-time via a build.rs file.word_filter_macro
: The logic for generating the `WordFilter`` definition at compile-time via a proc_macro.word_filter_shared
: Possibly any shared logic.After thinking about it all day today, I'm coming closer to a concrete solution. A proc_macro likely won't work in other situations where code generation also will not work, due to the self-referencing that will still be occurring. An inline builder with unsafe trickery on the references will not be practical either, due to the user needing to know the size of the WordFilter's node list before it is built, which can be difficult to determine. Therefore, I'm only going to focus on code generation for now.
The things needed only for code generation are:
censor
moduleThe generated code will be written in structs and enums which are:
WordFilter
and RepeatedCharacterMatchMode
Node
and Type
And the usage of that code will involve traversing through those Node
s using:
Walker
and Status
.WordFilter
methods.Unfortunately, Node
and Type
will need to be made public for the code generation to work. Also, all fields of Node
and WordFilter
will need to be made public. Those can be hidden (as they are in phf
), with a note saying not to ever use them and that they aren't guaranteed in the API stability, and that's about the best we can really do.
The question still remains: should the code generation be in a separate crate? Or in the same crate?
When constructing WordFilter
s in a build.rs, specifying the censor
is a bit difficult, since there isn't a simple way to map closures to string representations of themselves. It might make things a lot easier if users were to specify the censor
closure on calls to the censor()
method.
After experimenting with this a bit, I've determined a couple of things:
WordFilter
at run-time is a no-go, and it will only be doable at compile time (which is fine, IMO).phf
. For example, prost
and build-info
have their associated prost-build
and built-info-build
crates.For now, I have also dropped the RepeatedCharacterMatchMode
setting. I'm not sure how it would best fit in here, since either way doesn't really change the internal structure of the WordFilter
. I don't think many users would want to disable repeated character matching anyway, so for now I'm making it be the only mode available.
WordFilter
s are constructed at runtime, which is suboptimal due to the immutable nature of aWordFilter
. Most users will want a staticWordFilter
, which is only doable with something likeonce_cell
orlazy_static
and still requires runtime initialization.The main thing holding
WordFilterBuilder::build()
from being const` is the heap allocations that are required. There is some effort for const heap allocation, but it is very far off and not something I would like to sit around and wait for.An alternative is to take the same approach that
phf
does, namely compile-time construction using eitherphf_codegen
(in a build.rs file) orphf_macros
(as a procedural macro in a source file). This would require changing the entire structure of aWordFilter
(although I think accessing it through its query methods can remain mostly untouched, as long asWalker
can interface with the new structure). The following structure would work, I think:I believe the static self-references should work fine (my limited testing of them on the Rust playground indicated it would). This allows for all of the
Node
s to be stored on the stack instead of the heap, as they currently are. This is ideal, since we don't ever need to dynamically manageNode
s anyway.The downside is that construction of a
WordFilter
will not be able to be done dynamically at all. If any user wants to create aWordFilter
based on some variables (perhaps a user config file or something), it won't be possible. However, I don't see any way around this, as theNode
s need to be able to reference each other, perhaps in circuitous ways, and that can't be done at both compile and runtime.One more note about implementation: the
Node
s inWordFilter
could be stored in a fixed-sized array, rather than a slice, using const generics. I'm not sure if that has any sort of benefit whatsoever, as I would assume a static slice would just be optimized anyway, but it is worth looking into.