PeytonT / serde_ion

Rust implementation of serialization and deserialization of the Ion data format
Other
6 stars 0 forks source link

Choose data structure(s) to use for symbol tables #13

Open siler opened 4 years ago

siler commented 4 years ago

I ran into some rough edges of using a Vec<> symbol table when running some of the tests. The tests in question deal with large values for max_id (a subset of good/subfield*.ion). As an example:

$ion_symbol_table::{
    imports:[ { name: "com.amazon.blah.blah.blah",
                version: 1,
                max_id: 268435444 // 268435454 - 1 - 9 (system symbols)
              } ],
    symbols:[ "boundary-1", "boundary", "boundary+1" ]
}

This currently causes the tests to halt (though I am not particularly patient with what I consider unit tests) on my system.

The Rules

The rules for symbol lookup are outlined in http://amzn.github.io/ion-docs/guides/symbols-guide.html

In addition to that, the rules for symbol table construction are of interest: http://amzn.github.io/ion-docs/docs/symbols.html#local-symbol-tables

The spec describes:

SymbolTokens can be looked up via their text or their symbol ID. The symbol ID is monotonically increasing key. When looked up by text, the key with the lowest value should be returned. That is, the symbol that was registered first.

Usage

This means our insertion pattern is essentially append-only (as a bonus in most cases we should know the size of large tables before we need to allocate them, not counting symbols encountered in the token stream).

For reads, it's more complicated since we essentially need a bi-directional index which "summarizes" in one direction.

If we focus on the high-performance use cases then we should favor the index lookup use case, which is what we currently do. The reverse lookup is unfortunately O(n).

In addition to this, the aforementioned tests essentially require a sparse data structure. I'm not sure if this is a realistic requirement for other use cases as any situation where this could arise would surely be an error (right?).

A simple solution is a pair of data structures (list, map). Is there a use case that requires locking? Text parsing has to be synchronous as encountered symbols must be added to the symbol table in order of occurrence. The binary format supplies all symbols up front and relies on the symbol IDs for lookup, so the symbol table would not be modified if skip-searched from multiple threads.

PeytonT commented 4 years ago

This means our insertion pattern is essentially append-only (as a bonus in most cases we should know the size of large tables before we need to allocate them, not counting symbols encountered in the token stream). For reads, it's more complicated since we essentially need a bi-directional index which "summarizes" in one direction.

My lack of attention to the text format is showing here. I had the impression that symbol tables in the text format were just weird-looking values, but I suppose I'm missing something.

I've been thinking about this topic for the binary encoding, since I'm getting around to working on serialization. A first pass over the Ion stream to encode can accumulate the symbol values into a counting map. Sorting the keys by their counts greatest to least gives the order that they should appear in the symbol table. The local symbol table can then be serialized, and the counting map can be reused by updating the count values to now store the index of each key. Then a second pass over the Ion stream can serialize the values, looking up each symbol's text in the map to find the symbol ID.

In addition to this, the aforementioned tests essentially require a sparse data structure. I'm not sure if this is a realistic requirement for other use cases as any situation where this could arise would surely be an error (right?).

I agree that this is unrealistic. The test's max_id size is probably larger than the combined size of every word in every language. Moreover, at a generous 4 bytes per symbol text (otherwise some overlap) it's more than a gigabyte. Hardly seems likely to be resulting in compression, given that most of the symbol IDs would need to be 5 bytes in the binary encoding.

This is actually something irritating about Ion, the spec lays out support for all sorts of arbitrary-sized things, and then implementations frequently don't actually support them due to implementation overhead.

I'm don't like to artificially limit sizes that the spec says should be unbounded, but I think we should for things that are clearly not useful. As an example, throughout the binary parser there's an assumption that length values can't be greater than max usize, since otherwise, how could the whole thing be addressable? Moreover, paying significant overhead during parses of reasonable data so as to be able to support unreasonable data doesn't sound like a good idea.

Side note: I've been pondering a macro for generating binary parsing combinators for a sequence of maximum byte lengths of length values. At some point I might propose an optional behavior, where local symbol tables could contain optional max_varint and max_varuint fields, giving a hint to the parser that it could switch until the next IVM to a codepath using only Sized buffers.

A simple solution is a pair of data structures (list, map).

I don't know what the text encoding needs (perhaps you have a reference), and I think the binary encoding should use different data structures for serializing and deserializing. To support serializing a (list, map) is needed, but for deserializing we only need (list, map) if we want to support absurdly large symbol IDs. It might make sense to split the difference and have a (list, Some<map>, max_defined_symbol_id), with a reasonable limit on the maximum size of the list before symbols would begin getting sparsely populated in the map. Knowing the maximum defined symbol ID would let the parser determine if a symbol ID not found in the sparse map should be assigned a default value or should error.

Is there a use case that requires locking? Text parsing has to be synchronous as encountered symbols must be added to the symbol table in order of occurrence. The binary format supplies all symbols up front and relies on the symbol IDs for lookup, so the symbol table would not be modified if skip-searched from multiple threads.

The binary format shouldn't ever require locking. Between one IVM and the next the local symbol table doesn't update, and the symbol table is replaced after the next IVM is encountered.

siler commented 4 years ago

My lack of attention to the text format is showing here. I had the impression that symbol tables in the text format were just weird-looking values, but I suppose I'm missing something.

In text they can show up as strings like symbol, 'spaced symbol', or $23. That last one is special, and is the text representation of a symbol ID. So in theory you can do the same thing you would in the binary format and stash all your symbols up front in a local symbol table.

I've been thinking about this topic for the binary encoding, since I'm getting around to working on serialization. A first pass over the Ion stream to encode can accumulate the symbol values into a counting map. Sorting the keys by their counts greatest to least gives the order that they should appear in the symbol table. The local symbol table can then be serialized, and the counting map can be reused by updating the count values to now store the index of each key. Then a second pass over the Ion stream can serialize the values, looking up each symbol's text in the map to find the symbol ID.

Out of curiosity, why is the order of items in the encoded symbols list important?

This is an interesting part of Ion. Depending on what your transmission patterns are or if you want to be able to work on chunks the requirements for when to write a symbol table are different. It feels like a latency vs. size problem. The more symbol tables you write, the more space you're wasting over the wire. The less often you write a table, the longer you have to wait buffering before you send data.

That coupled with the capability for appending via the $ion_symbol_table import means keeping an encoder around can be useful, but maybe only in certain circumstances (i.e. when the set of shared symbol tables used in a stream doesn't change - otherwise you'd thrash rebuilding them).

I'm don't like to artificially limit sizes that the spec says should be unbounded, but I think we should for things that are clearly not useful. As an example, throughout the binary parser there's an assumption that length values can't be greater than max usize, since otherwise, how could the whole thing be addressable? Moreover, paying significant overhead during parses of reasonable data so as to be able to support unreasonable data doesn't sound like a good idea.

Agreed, I had that same thought w/r/t addressability when working through part of the text parser.

Side note: I've been pondering a macro for generating binary parsing combinators for a sequence of maximum byte lengths of length values. At some point I might propose an optional behavior, where local symbol tables could contain optional max_varint and max_varuint fields, giving a hint to the parser that it could switch until the next IVM to a codepath using only Sized buffers.

I'm getting a compile-time functional lens vibe from this and it's great, though I don't understand it well enough to know why that optional behavior is necessary.

A simple solution is a pair of data structures (list, map).

I don't know what the text encoding needs (perhaps you have a reference), and I think the binary encoding should use different data structures for serializing and deserializing. To support serializing a (list, map) is needed, but for deserializing we only need (list, map) if we want to support absurdly large symbol IDs. It might make sense to split the difference and have a (list, Some<map>, max_defined_symbol_id), with a reasonable limit on the maximum size of the list before symbols would begin getting sparsely populated in the map. Knowing the maximum defined symbol ID would let the parser determine if a symbol ID not found in the sparse map should be assigned a default value or should error.

I've also been thinking about some kind of interval-based data structure that would allow delegation to symbol tables. The current symbol table, for example, would always have a first interval of 0-9 mapping with a 0 offset to the system symbol table. The next interval would be to a shared table or the current table if there were none.

That adds a layer of indirection and is potentially a little slower, but avoids many allocations that would otherwise occur. It would also probably be more complicated with Rust's memory model, though slightly simplified by the fact that shared symbol tables can be read-only. I'm currently leaning towards "too complicated" for this, but depending on shared symbol table usage it might be advantageous.

PeytonT commented 4 years ago

In text they can show up as strings like symbol, 'spaced symbol', or $23. That last one is special, and is the text representation of a symbol ID. So in theory you can do the same thing you would in the binary format and stash all your symbols up front in a local symbol table.

Thanks for clearing that up for me.

Out of curiosity, why is the order of items in the encoded symbols list important?

Compactness of the binary encoding. Depending on where they show up, symbols can be encoded as VarUints. The first 2^7 symbol IDs can be represented with only 1 byte, but the next (2^14 - 2^7) need 2 bytes, etc. You lose the first 11 slots to the system symbol table, and having more than 2^14 symbols seems very unlikely, so it's basically a matter of ensuring that if you have more than 118 symbols, the most frequently used ones are first. There are also plausible cache performance benefits from keeping all of the most used symbols together at the beginning of the Vec.

Side note: This is why if we do end up going with a max size for the symbol table Vec I would suggest 2^14.

That coupled with the capability for appending via the $ion_symbol_table import means keeping an encoder around can be useful, but maybe only in certain circumstances (i.e. when the set of shared symbol tables used in a stream doesn't change - otherwise you'd thrash rebuilding them).

I think a persistent encoder is definitely desirable. At some point I plan to model an Ion Stream as something other than a Vec<Value>, mostly so that appending can be done efficiently.

I'm getting a compile-time functional lens vibe from this and it's great, though I don't understand it well enough to know why that optional behavior is necessary.

Not necessary, but likely a significant performance boost. Binary encoded Ion frequently employs a "keep reading until you find a stop signal" pattern, which, because variably sized numerics can be greater than the maximum size of any primitive, requires allocating ?Sized containers any time you want to parse one. Usually many times per Value. Knowing that, for example, all variably sized numerics in a segment of an Ion stream will fit in an int64/uint64 means you can parse them with non-allocating combinators that return int64/uint64 instead of BitInt/BigUint. (Though I believe BigUint has an internal optimization to avoid allocations for small uints.)

This is something I'm probably going to leave off on doing at least until const generics are stabilized. Moreover it would require buy-in from the Ion maintainers, which is hard to get.

I've also been thinking about some kind of interval-based data structure that would allow delegation to symbol tables.

An interval map that maps to offset dyn trait objects that impl Index seems worthwhile to consider. In the normal case each interval would just be a Vec, but in the pathological case it could fall back to some kind of sparse default-map for very large intervals. This seems particularly valuable when shared symbol tables are involved, but I think we would want to benchmark if we went down this road.

Leaving aside shared symbol tables, another option would be to have separate Vec-only and Map-only code paths, and have the parser optimistically use the Vec codepath unless it encountered a local symbol table with greater than, say, 2^14 slots, at which point it would bail out and restart the segment using the slow Map codepath. This would save us from having to implement complex data structures, and would ensure ~no overhead in what I imagine is the usual scenario of no shared symbol tables and less than 2^14 total symbols.