BurntSushi / ucd-generate

A command line tool to generate Unicode tables as source code.
Apache License 2.0
93 stars 21 forks source link

Size optimizations for range slices. #35

Open thomcc opened 4 years ago

thomcc commented 4 years ago

Note: This branch's parent is the branch from #34, so it includes that commit too. Once that lands I can rebase this branch onto master.

This implements two size optimizations for slice range tables, described below. Neither are that interesting or clever.

In addition to this, I also reran the table generation code (it had gone stale) and changed ./scripts/generate-unicode-tables so that it can now optionally take a path to the ucd-generate binary. This was mostly so I could add benchmarks containing these optimizations.

Split character ranges

Instead of generating &[(u32, u32)], if --split-ranges is passed we emit two tables; one including ranges that fit in u16, and one for those that do not. Any that span the boundary are split in to two ranges, so that for a given char, you only have to test one table or the other, based on the value of that character.

Note that this is incompatible with --char, as there's no way to use a char literal and have 16bit output.

In practice how much this saves depends on the table, and the upper bound is reducing size to 50%. That said, in practice it does remarkably well for some tables, even beating rustc's skiplist format on a few, despite how simple it is.

Searching is carried out by some function like this:

fn search_split_property_table(
    search_char: char,
    table: (&[(u16, u16)], &[(u32, u32)]),
) -> bool {
    if let Ok(search_char16) = u16::try_from(search_char as u32) {
        table.0.binary_search_by(|&(s, e)| {
            if s > search_char16 {
                Ordering::Greater
            } else if e < search_char16 {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).is_ok()
    } else {
        table.1.binary_search_by(|&(s, e)| {
            if s > search_char {
                Ordering::Greater
            } else if e < search_char {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).is_ok()
    }
}

It's a little complex, but I don't think it warrants being put in a separate library we ask them to depend on.

Note that this also works with tables that map codepoint ranges to values. For example: ucd-generate general-category --rust-enum --split-ranges will produce a table like:

pub const GENERAL_CATEGORY: (
    &'static [(u16, u16, GeneralCategory)],
    &'static [(u32, u32, GeneralCategory)],
) = (&[
    // ...
], &[
    // ...
]);

When used like this, searching is similar. However, for cases like this, passing the --separate-values is beneficial.

Separated value arrays

The general category example above shows a flaw, which is that despite the fact that size_of::<GeneralCategory>() is 1.

This is not a flaw specific to the split ranges code, and exists for the other mapped ranges too.

The optimization places the "value" type into a seprate array. That is, if your table was a &[(char, char, SomeEnum)], this optimization turns it into: (&[(char, char)], &[SomeEnum]), where both slices have the same length

To search this, you do a binary search over the range table, and use the index in the enum table.

fn search_value_table<S>(
    search_char: char,
    table: (&[(u32, u32)], &[S]),
) -> Option<&S> {
    table.0.binary_search_by(|&(s, e)| {
        if s > (search_char as u32) {
            Ordering::Greater
        } else if e < (search_char as u32) {
            Ordering::Less
        } else {
            Ordering::Equal
        }
    }).ok().map(|index| table.1[index])
}

The benefit here is that now, neither slice will have any padding values.

Additionally, this works with the --split-ranges option mentioned above, in order to reduce size as far as possible.

One note is we don't emit two separate separated value arrays -- just one whose length is the sum of the of the (u32, u32) and (u16, u16). I don't think this matters in practice. There are slight performance benefits this way (shared object concerns), but I'd be willing to have two -- a point of hesitation I had that pushed me in the current direction was: it could be easy to mix up which table was which, as they're the same type.

My hope is that if there's only one array, you'll look it up.

fn search_split_value_table<S>(
    search_char: char,
    table: (&[(u16, u16)], &[(u32, u32)], &[S]),
) -> Option<&S> {
    if let Ok(search_char16) = u16::try_from(search_char as u32) {
        table.0.binary_search_by(|&(s, e)| {
            if s > search_char16 {
                Ordering::Greater
            } else if e < search_char16 {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).map(|i| table.2[i]).ok()
    } else {
        table.1.binary_search_by(|&(s, e)| {
            if s > search_char {
                Ordering::Greater
            } else if e < search_char {
                Ordering::Less
            } else {
                Ordering::Equal
            }
            // Note: the offset returned by the search of the 32-bit
            // table is assumed to be relative to the end of the
            // 16-bit table.
        }).map(|i| table.2[i + table.1.len()]).ok()
    }
}

In practice, all this together improves performance a lot. More than I would have expected -- it seems to be one of the faster entries in the benchmark -- (caches...), however the main reason I added it was for size reduction. As an example, when both optimizations are used on the General_Category property value table, it shrinks from 38388 bytes to 18703 bytes.


Edit: A couple of notes I thought of:

  1. I'm in no way tied to these names. I don't think they're particularly good, even.
  2. I wish there were a place in ucd-generate's documentation to put more detailed descriptions of the table formats, and sample usage code. I remember not even being sure if the u32 pairs were inclusive or exclusive when I first used it.
BurntSushi commented 2 years ago

So this all looks great. One question I have is how much the separated values part of this helps? It does seem to add a fair bit of complexity. So what I'm wondering is whether the juice is worth the squeeze. (Although I'm still inclined to merge it, since I see ucd-generate as a dumping grounds of sorts.)

I'm going to cut a new release soon with Unicode 14, but without this. (But with your [char; 3] PR.) Once that's out, if you wanted to get this PR into shape, I'll do my best to take a closer look at it soon and try to get it into another release.

thomcc commented 2 years ago

So this all looks great. One question I have is how much the separated values part of this helps? It does seem to add a fair bit of complexity. So what I'm wondering is whether the juice is worth the squeeze.

Yeah it's hard to say how much it matters. There are two benefits:

  1. It's good for performance; binary search is moderately cache hostile, so you want to ensure that the data that gets pulled in for each fetch is relevant. That said, I doubt these are frequently on the hot path in real code, and it would be very hard to measure as micro-benchmarks for cold-cache situations are very hard.

  2. More importantly, in cases where the value is 1 byte (like enums and such), it reduces total table size by avoiding the padding required to align to a char/u32/u16. This can be pretty significant, as it ends up causing overhead on each entry. For (u16, u16, u8) this is ~16% of the table spent on padding, and for (u32, u32, u8), it's 25%.

    However, the padding can also avoided by changing u16 to [u8; 2] and u32 to [u8; 3] (well, [u8; 4] but if you're going to do that, you might as well take advantage of the fact that Unicode scalars are 21 bits while you're at it). Then, ([u8; 3], [u8; 3], u8) and ([u8; 2], [u8; 2], u8) have no padding (in practice, obviously the compiler is free to do whatever it wants to repr(Rust) types, but it's reasonable to assume it won't decide to waste bytes for no reason).

    That said, that would be something I'd add in another patch, as it's starting to get to a point where #39 is useful.

So, it's hard to say. Overall, my hunch is that while we can avoid the padding in this case, there are probably other cases where we'd like to split things out like that (for one reason or another). So, if the complexity would help out other cases that want to separate keys from values, it's probably worth it. If it's just for this case, and probably couldn't be reused, it's a lot more dubious. I haven't looked at the code in over two years either, so I don't remember the details.

Note: I haven't looked at this patch in over 2 years, it's also possible that I can make it simpler, I'll look when I get a chance.