GrammaTech / gtirb

Intermediate Representation for Binary analysis and transformation
https://grammatech.github.io/gtirb/
Other
305 stars 36 forks source link

New table API #18

Closed nweston closed 5 years ago

nweston commented 5 years ago

There are a few specific things I'd like feedback on:

  1. Naming We talked about changing to AuxData but it doesn't exactly roll off the tongue. I've stuck with Table for now. I think it describes the primary intended use pretty well, even if it doesn't capture everything the class can do.

  2. Interface I've provided a variant-like interface with an assignment operator, get, and get_if, which minimized the changes to existing code. I couldn't find any indication of whether one is supposed to overload std::get or implement it in your own namespace. Any opinions? I've gone with the latter for now but it should be an easy change.

  3. Error handling get() does its type check with Expects. If you try to read the wrong type it causes and assertion failure/crash. This doesn't seem great but I'm not sure what other option we have without exceptions. I also provided get_if() which returns nullptr on a type mismatch, but it's a little more cumbersome to use.

  4. Serialization format Last week we talked about adding more structure to the protobuf message for Tables, with the idea that this would simplify cross-language access. But I don't think this is actually possible because protobuf doesn't have templates/type variables. So basically it can't express the fact that sequence can contain many different types, but any given sequence has to be homogeneous. Instead you end up with every element of a sequence possibly having a different type. Not only is this inefficient, it would allow clients in other languages to create tables that the C++ API can't read.

    Given that limitation, I decided to stick with the current approach of serializing the contents to a byte array and having protobuf store that. Clients in other languages will need some code to unpack the byte arrays, but the format is pretty simple and this shouldn't be a huge burden. We could provide example code in one or two common languages when we get to that point.

Overview

The interface here is pretty similar to the old tables:

   table = std::vector<int64_t>(...);    // Store something
   auto &vec = get<int64_t>(table);  // Retrieve something

The big difference is with nested containers. Previously if you wanted, say, a map of ints to strings, you had to use std::map<int64_t, table::ValueType>, where ValueType is itself a variant. Iterating over that map involved multiple levels of get calls:

for (const auto &elt : get<std::map<int64_t, table::ValueType>>(table)) {
  auto &str = get<std::string>(elt.second);
  ...
}

Now you can store a map<int, string> (or any other arbitrary nesting of containers) directly, and you only need a single get call to access it.

The other notable improvements are support for all integral types, and tuples.

Serialization

There are two pieces to this. First, we generate a portable type ID. Containers are generalized (e.g. std::vector becomes sequence), and integers are given a name which reflects their exact size (e.g. int becomes int32_t or whatever is appropriate). So you end up with a string like "sequence<map<int32_t, string>>".

Second, the structure is flattened into a byte array. For primitive types, this means swapping to little-endian order and copying the bytes directly. For containers, this means writing out the number of elements (as a uint64_t), then writing out all the elements in a contiguous block.

When deserializing, we initially just load the byte array. Unpacking is deferred until someone calls get. At that point, we check that the type-id of the requested type matches what we serialized. For example, if you serialize a 32-bit int on one platform and try to get it as a 64-bit int on another platform, you'll get an error at runtime.

If the types match, we then unpack the byte array, which is a straightforward reversal of the packing process.

AaronBallman commented 5 years ago

(Initial thoughts, better review to come.)

  1. I still think Table is a bad name (tables are things with rows and columns, this has neither). I don't think AuxData is the best name, but it at least doesn't come with the baggage that "table" does. I'd be fine with AuxiliaryData, AncillaryData, while DataOnTheDownLow might be a bad choice. I'm betting we can come up with more names if we try. :-)
  2. We can't overload std::get() (it's UB to try that, see [namespace.std]p1. I think we will be okay adding a template specialization, so long as the implementation adheres to the specification of std::get() and at least one of the types involved in the specialization is a type we define. Not certain if that's our circumstance though (I've not gotten that far).
  3. Assertion on failure is not okay, but based on the problem, it sounds like a good use of std::optional. Have the call to get<T>() return a std::optional<T> and the user can know whether it succeeded or not. If we want some sort of further information (such as errors), we could implement something akin to ErrorOr<> from LLVM.
  4. I think this approach is reasonable, but we should probably have a list of "things to keep in mind" for users in the nearby documentation, or else they're liable to run into problems. Things like: structure layout may be different, endianness is a thing, etc.