Discussion: Emitting code to search tables with complex layouts

thomcc commented 4 years ago

Right now, some tables are too complex for it to be reasonable to ask users to write the table search code themselves.

Currently, the model we follow for this is to have the table depend on an external crate, for example: ucd-trie, or fst. Lets ignore fst, because I have only surface level understanding of it (I really should get around to reading the blog post on it...).

Anyway the cases I care about are much closer in spirit to TrieSet:

Accessing the data in the trie requires a very specific set of undocumented operations. Even if it were documented, it's fiddly enough that we don't want users to have to write these lines themselves (even asking them to copy/paste them in would imply that they should understand it and are taking over maintenance of it).

We'd also like to think of the trie format as subject to change... (even though adding the statics essentially freezes it, as I'll get into) So we just ask them to take a dep, and ensure it can be as close to zero-cost as possible (#[no_std], functions inline, etc) so that nobody really has an excuse for a reason to not use it.

That said, this approach has downsides:

Problems with the separate-crate approach

The main problem is that it "freezes" the format of the trie -- any update that doesn't maintain 100% compatibility with existing tries must be a breaking change, or come in a TrieSet2 type or something. This might be acceptable in the case of the trie (IMO it's not -- it's not like there the trie is impossible to improve upon), but I don't think it's a good solution overall.
Users have to include ucd-trie as a dependency, and the dependency is bringing in a single 18 line function (well, and a type definition).

This is somewhat petty of me, but there are fairly few cases when I'm comfortable bringing in an external dependency for an 18 line function. Given that bstr has a line about "pushing back against the micro-crate ecosystem" in the readme, I suspect you can at least see my reasoning there.

That said, the downsides to this (aside from introducing dependency) are subtle, and it's not exactly clear what else to do -- using the language-provided method of code sharing for this is clean and obvious, if nothing else.

Anyway, the first point is what I care more about (or at least, I'd just suck up the second point if that were it).

In particular, I'd like to introduce new formats which are not frozen, and can change without worrying about compatibility with an external crate. Specifically (With most of these, I'd want the data structure exposed to be totally opaque, or potentially just expose a single function):

libcore-style bitset and skiplist, as discussed in #30.
re2-style range-based cyclic case-folding tables: https://github.com/google/re2/blob/master/re2/unicode_casefold.h.
Apply compression algorithms to some tables, possibly only rarely used ones (for example, the name table looks like it could compress well if massaged).
- In practice this would be somewhat customized, since most off the shelf lossless compressors aren't allowed to modify the input to compress better -- we could, so long as we can also emit code to reverse that modification if needed.
Better trie output
Fully flattened versions of some tables, representing any slice or str are represented as indices into a big shared string / shared array. (The property value tables in particular practically have a bulls-eye on their back for this).
...

Even some of the current formats are getting complex (--split-ranges --separated-values isn't that hard to write code for, but it's complex enough that you probably need documentation to do so).

But even for simple formats, like the slice of (u32, u32), I've gotten reading it it wrong more times than I want to admit![0] This may say more about me as a programmer than anything else, but I would have definitely used an option to spit out the table searching code.

Potential solutions

There's basically a few options here.

1. Stick with the external crate approach.

Pro: No duplicated code, Uses the language-provided method of sharing code.
Pro: Users don't have to look at the searching code, maintain it in any way, or worry about the scandalous way in which it uses integer arithmetic.
Con: Users need to add dependencies that may feel trivial to them.
Con: Discourages us from improving those tables.
Con: Causes people like me to file issues like this.

2. Emit code for searching in the same file.

Pro: No compatibility worries, we could even make the table structures explicitly opaque from outside the module.
Pro: Users can tweak ucd-generate options in various ways, without having to change their code, we could offer something like a --minimize flag that picks whichever format has the smallest output size.
Pro: Doesn't require any major changes to ucd-generate, nor does it get in the way of changes we may find desirable.
Con: Forces there to be multiple copies of the same code.
- Note: In practice these are probably inline anyway. Additionally, linkers can (sometimes) dedupe symbols with identical code, which is why sometimes why you see calls into the wrong (e.g.) Copy impl in profiler output and similar.
- Con: Users need to have code in their project that they don't maintain, and aren't expected to understand.
  - Worse, this code may not conform to their opinions around code style (No matter how clean the code we generate is, it's impossible to please everybody. I've seen someone argue that https://rust-lang.github.io/rust-clippy/master/#integer_arithmetic should be a more commonly used lint...).

3. Have a dedicated option for emitting the shared code in its own module...

3a: ...and have the user be responsible for passing the table and search term into the functions in that module. (This requires the table structure be defined as just tuples, slices, etc. And not require shared type definitions)
3b: ...and have the user pass the name/path of the shared code as an argument when generating the table (e.g. emitting to unicode_common.rs and then passing --module-path 'super::unicode_common' to everything else).

The pros/cons for these two are similar:

Pro: No duplicated code.
Con: All the stuff about users having "code in their project that they don't maintain" from above.
Con: It feels a little weird since we're explicitly not using the the language-provided method of sharing code.
- I imagine people who don't have issues with dependenies might think "why isn't this just in an external crate".
Con: Need some way of saying which table layout you're using, and thus need the code for. E.g. interacts poorly with how ad-hoc we are about ouput kinds is.
- Could complicate the CLI greatly unless some work to be less ad-hoc here was done first, maybe.
Con: For 3a, this doesn't fully addresses the compatibility issue.
- Because the table structure has to be "defined as just tuples, slices, etc", the user can (and likely will) depend on on the concrete layout here, meaning some types of updates could cause compile errors.
- Even if the layout remained constant, and just the interpretation of it changed, the user could forget to regenerate the shared code to match. This isn't a case I care as much about -- And it could be addressed in part by emitting exhaustive tests with the table data.
  - It would probably mean we would need to refrain from unsafe in the table access code, which isn't a thing we do now, but in some cases it could be worth it, potentially.
In 3b, neither of these issues exist -- we can provide structure definitions, and we could emit a const indicating a version number in the shared code, and a static assert that they match in the table module. This is possible because the table module has the path to the shared code, which is not true for the 3a. (Something like const _: [(); shared::BITSET_VERSION] = [(); 1];)

4. many-table output

Change ucd-generate so that it can output as many tables as are needed in a single run of the command. Then, it would generate the shared code for exactly the set of tables types which are needed as well as all the tables.

Pro: Opens the gate for cross-table optimizations: (shared rANS symbol frequencies as mentioned in https://github.com/BurntSushi/ucd-generate/issues/30#issuecomment-627235410)
Pro: No compat worries, since we control the world. Could emit code to detect version mismatch as in 3b above if desired.
Pro: Snags another benefit from the jaws of the python generators, which can easly do this sort of thing.
Con (and an enormous one): ucd-generate is not set up for this sort of thing, and it would be very hard to add. I don't know how the CLI params for this would look at all, and so we'd need a big benefit for it to be worth it, tbh.

This is of course not complete, there are also hybrids you could imagine... But it's getting long.

Opinions

In reverse order -- popping each of these off my mental stack:

Choice 4: multi-table output

I like the idea of this choice: Generate only what's needed, and have a full understanding of desired dataset, potentially opening further optimizations.

And really, I wish we lived in a world where we already had this information... But that's not the world we live in, and I don't even know what this would look like in the CLI. How much code would have to change? Honestly It doesn't seem worth it. I suspect cross-table optimizations would not be worth a major change like this anyway, and things that do prove useful can just be handled on a case-by-case basis with a new subcommand or flag or whatever.

Choice 3: New subcommand to emit code in separate module

This seems very fiddly. In particular I think describing the sets of all the things you want to search means it has similar problems to choice 4 (need to describe the set of tables you're generating).

If we made the code generation bits more consistent and used that to be more principaled about the CLI options it wouldn't be that bad to add, but I don't think it has enough benefits to outweigh the complexity and downsides at the moment.

Choice 2: Emit code in same module.

This doesn't have many major downsides, but looks a little goofy. I think the concerns about code style are fine so long as what we emit passes default clippy lints and such.

If we do this I suspect it's only a matter of time before someone ask for choice 3, but maybe things will be cleaner then and it will be easier to add. This wouldn't prevent that, or anything.

Choice 1: Status quo

If I liked this, wouldn't have filed this issue.

Wrapping up

So yeah, number two is my opinion. In practice the changes here would be:

Adding a flag to emit code that searches the tables for existing formats.
Support this flag for new formats, potentially exclusively if the format is unreasonable to access manually or still subject to change.
Add an option to emit --trie-sets which are standalone and don't depend on ucd-trie

Thoughts?

(Note that a large part of the reason behind this is because I feel like just submitting a PR with one of those weird data structures, and the code generated next it would at a minimum raise, eyebrows. I also feel like the current status is not ideal).

BurntSushi commented 2 years ago

I think our sensibilities are pretty well aligned here. (2) is also kind of where I land as well. I also do not like forcing folks to bring in tiny micro-crates. Because of that, it leads me to not want to make extremely complex formats, but that in turn obviously biases against optimizing the heck out of these tables. And I'm generally pretty amenable to doing what we can to optimize them since they tend to be so widely used and foundational.

So yeah, I agree. If you wanted to go down this route, I'd support (2).

thomcc commented 2 years ago

Wow, it's great that you're active in this repo again. The timing almost couldn't be better, as I've been playing around with unicode tables very recently (for libcore). Most of these ideas would absolutely require custom code query the tables, though, as is the case for pretty much anything more complex than "binary search".

Many of my ideas ended up probably being non-ideal for libcore, where saving a few kilobytes on table size isn't worth a performance regression, as most of the tables it has are used by methods people actually use.

But for a case like regex-syntax they'd be... a lot more compelling, it has far more tables (almost a complete set) and many of them are pretty rarely needed. (Actually, the regex-syntax case offers a lot of opportunities that could significantly reduce the table size, but I'll file an issue about that when I have more time to elaborate).

BurntSushi commented 2 years ago

Indeed, regex-syntax can very likely admit slower lookups in favor of lower memory usage. We do have to keep an eye on it somewhat though, since any operation can be very easily repeated many times via repetition operators or for very large Unicode classes with lots of ranges. It's probably the case that even in those areas, regex compilation will dwarf whatever regex-syntax does, but do we need to at least keep an eye on it. (I am working on a new and more comprehensive benchmark suite for the regex crate, which should include things like "how long does it take to produce the HIR for this regex.")

BurntSushi / ucd-generate