BurntSushi / ucd-generate

A command line tool to generate Unicode tables as source code.
Apache License 2.0
95 stars 21 forks source link

Discussion: Emitting code to search tables with complex layouts #39

Open thomcc opened 4 years ago

thomcc commented 4 years ago

Right now, some tables are too complex for it to be reasonable to ask users to write the table search code themselves.

Currently, the model we follow for this is to have the table depend on an external crate, for example: ucd-trie, or fst. Lets ignore fst, because I have only surface level understanding of it (I really should get around to reading the blog post on it...).

Anyway the cases I care about are much closer in spirit to TrieSet:

Accessing the data in the trie requires a very specific set of undocumented operations. Even if it were documented, it's fiddly enough that we don't want users to have to write these lines themselves (even asking them to copy/paste them in would imply that they should understand it and are taking over maintenance of it).

We'd also like to think of the trie format as subject to change... (even though adding the statics essentially freezes it, as I'll get into) So we just ask them to take a dep, and ensure it can be as close to zero-cost as possible (#[no_std], functions inline, etc) so that nobody really has an excuse for a reason to not use it.

That said, this approach has downsides:

Problems with the separate-crate approach

  1. The main problem is that it "freezes" the format of the trie -- any update that doesn't maintain 100% compatibility with existing tries must be a breaking change, or come in a TrieSet2 type or something. This might be acceptable in the case of the trie (IMO it's not -- it's not like there the trie is impossible to improve upon), but I don't think it's a good solution overall.

  2. Users have to include ucd-trie as a dependency, and the dependency is bringing in a single 18 line function (well, and a type definition).

    This is somewhat petty of me, but there are fairly few cases when I'm comfortable bringing in an external dependency for an 18 line function. Given that bstr has a line about "pushing back against the micro-crate ecosystem" in the readme, I suspect you can at least see my reasoning there.

    That said, the downsides to this (aside from introducing dependency) are subtle, and it's not exactly clear what else to do -- using the language-provided method of code sharing for this is clean and obvious, if nothing else.

Anyway, the first point is what I care more about (or at least, I'd just suck up the second point if that were it).

In particular, I'd like to introduce new formats which are not frozen, and can change without worrying about compatibility with an external crate. Specifically (With most of these, I'd want the data structure exposed to be totally opaque, or potentially just expose a single function):

  1. libcore-style bitset and skiplist, as discussed in #30.
  2. re2-style range-based cyclic case-folding tables: https://github.com/google/re2/blob/master/re2/unicode_casefold.h.
  3. Apply compression algorithms to some tables, possibly only rarely used ones (for example, the name table looks like it could compress well if massaged).
    • In practice this would be somewhat customized, since most off the shelf lossless compressors aren't allowed to modify the input to compress better -- we could, so long as we can also emit code to reverse that modification if needed.
  4. Better trie output
  5. Fully flattened versions of some tables, representing any slice or str are represented as indices into a big shared string / shared array. (The property value tables in particular practically have a bulls-eye on their back for this).
  6. ...

Even some of the current formats are getting complex (--split-ranges --separated-values isn't that hard to write code for, but it's complex enough that you probably need documentation to do so).

But even for simple formats, like the slice of (u32, u32), I've gotten reading it it wrong more times than I want to admit![0] This may say more about me as a programmer than anything else, but I would have definitely used an option to spit out the table searching code.

Potential solutions

There's basically a few options here.

1. Stick with the external crate approach.

2. Emit code for searching in the same file.

3. Have a dedicated option for emitting the shared code in its own module...

  1. 3a: ...and have the user be responsible for passing the table and search term into the functions in that module. (This requires the table structure be defined as just tuples, slices, etc. And not require shared type definitions)

  2. 3b: ...and have the user pass the name/path of the shared code as an argument when generating the table (e.g. emitting to unicode_common.rs and then passing --module-path 'super::unicode_common' to everything else).

The pros/cons for these two are similar:

4. many-table output

Change ucd-generate so that it can output as many tables as are needed in a single run of the command. Then, it would generate the shared code for exactly the set of tables types which are needed as well as all the tables.


This is of course not complete, there are also hybrids you could imagine... But it's getting long.

Opinions

In reverse order -- popping each of these off my mental stack:

Choice 4: multi-table output

I like the idea of this choice: Generate only what's needed, and have a full understanding of desired dataset, potentially opening further optimizations.

And really, I wish we lived in a world where we already had this information... But that's not the world we live in, and I don't even know what this would look like in the CLI. How much code would have to change? Honestly It doesn't seem worth it. I suspect cross-table optimizations would not be worth a major change like this anyway, and things that do prove useful can just be handled on a case-by-case basis with a new subcommand or flag or whatever.

Choice 3: New subcommand to emit code in separate module

This seems very fiddly. In particular I think describing the sets of all the things you want to search means it has similar problems to choice 4 (need to describe the set of tables you're generating).

If we made the code generation bits more consistent and used that to be more principaled about the CLI options it wouldn't be that bad to add, but I don't think it has enough benefits to outweigh the complexity and downsides at the moment.

Choice 2: Emit code in same module.

This doesn't have many major downsides, but looks a little goofy. I think the concerns about code style are fine so long as what we emit passes default clippy lints and such.

If we do this I suspect it's only a matter of time before someone ask for choice 3, but maybe things will be cleaner then and it will be easier to add. This wouldn't prevent that, or anything.

Choice 1: Status quo

If I liked this, wouldn't have filed this issue.

Wrapping up

So yeah, number two is my opinion. In practice the changes here would be:

  1. Adding a flag to emit code that searches the tables for existing formats.
  2. Support this flag for new formats, potentially exclusively if the format is unreasonable to access manually or still subject to change.
  3. Add an option to emit --trie-sets which are standalone and don't depend on ucd-trie

Thoughts?

(Note that a large part of the reason behind this is because I feel like just submitting a PR with one of those weird data structures, and the code generated next it would at a minimum raise, eyebrows. I also feel like the current status is not ideal).

BurntSushi commented 2 years ago

I think our sensibilities are pretty well aligned here. (2) is also kind of where I land as well. I also do not like forcing folks to bring in tiny micro-crates. Because of that, it leads me to not want to make extremely complex formats, but that in turn obviously biases against optimizing the heck out of these tables. And I'm generally pretty amenable to doing what we can to optimize them since they tend to be so widely used and foundational.

So yeah, I agree. If you wanted to go down this route, I'd support (2).

thomcc commented 2 years ago

Wow, it's great that you're active in this repo again. The timing almost couldn't be better, as I've been playing around with unicode tables very recently (for libcore). Most of these ideas would absolutely require custom code query the tables, though, as is the case for pretty much anything more complex than "binary search".

Many of my ideas ended up probably being non-ideal for libcore, where saving a few kilobytes on table size isn't worth a performance regression, as most of the tables it has are used by methods people actually use.

But for a case like regex-syntax they'd be... a lot more compelling, it has far more tables (almost a complete set) and many of them are pretty rarely needed. (Actually, the regex-syntax case offers a lot of opportunities that could significantly reduce the table size, but I'll file an issue about that when I have more time to elaborate).

BurntSushi commented 2 years ago

Indeed, regex-syntax can very likely admit slower lookups in favor of lower memory usage. We do have to keep an eye on it somewhat though, since any operation can be very easily repeated many times via repetition operators or for very large Unicode classes with lots of ranges. It's probably the case that even in those areas, regex compilation will dwarf whatever regex-syntax does, but do we need to at least keep an eye on it. (I am working on a new and more comprehensive benchmark suite for the regex crate, which should include things like "how long does it take to produce the HIR for this regex.")