IreneKnapp / modern-data

A self-describing binary data format for dependently-typed object graphs.
MIT License
13 stars 0 forks source link

Integer endianness #3

Open IreneKnapp opened 9 years ago

IreneKnapp commented 9 years ago

The design as it stands does not consider the question of integer endianness at all. This is definitely an oversight, since it's a design goal to be able to describe existing data formats. There's a bit of tension because it's also a goal that the same schemas should work for at-rest, in-memory, and in-transit usage. Trying to work with non-native formats in memory is always going to be slower than working with native ones...

So there's a notion that I believe the system needs, actually, which is not yet expressed, and it's a special node type that says something to the effect of "here's the low-level type for how this data is represented; here's the high-level type for how it's used; and here's an isomorphism between them", by isomorphism meaning a pair of functions and a proof that they represent a 1:1 mapping that round-trips losslessly.

This is kind-of overlapping with the functionality of "satisfies" except that it's more than a predicate, it lets you dig into the internals; it would be pretty much required if you wanted to write a schema for a .jar that described both the ZIP container format and the formats of the files inside it. (That would be a rather enormous schema, of course...) It's "required" in the sense that, while anything like this could be expressed as a predicate, doing that loses the ability for tools with no special knowledge of the schema to traverse inside such a container.

I've earlier outlined the type-class "dictionary of accessors" approach, which doesn't require new primitives at all, and there's overlap with that as well. I do think each of these three options serves a separate purpose, but I'm running out of words right now... I'd like to go over this all at some point and figure out whether all three are really useful, or whether they can be consolidated into two, or what. That's a larger-scoped question that integers though.

Anyway, the isomorphism/container idea is not in yet, because I feel like "proof of 1:1-ness" is a complicated notion that I need to be able to play with the system before I can figure out the convenient way to actually write such a proof. But, byte-order is a trivial mapping, and one easy solution would be to add type nodes for each byte-order and size.

Also, it's tempting to specify a particular order and let higher-level code describe the translation... but that would make performance hassles someday, since many architectures do have fast ways to swab integers and it's ideal to not have to make the optimizer detect when a translation is a swab. That's several stretch goals away, but still it's a clear use-case that should be supported.

The existing binary-number types are:

    modern_node_type_int8_value = 8,
    modern_node_type_int16_value = 9,
    modern_node_type_int32_value = 10,
    modern_node_type_int64_value = 11,
    modern_node_type_nat8_value = 12,
    modern_node_type_nat16_value = 13,
    modern_node_type_nat32_value = 14,
    modern_node_type_nat64_value = 15,
    modern_node_type_int8_type = 26,
    modern_node_type_int16_type = 27,
    modern_node_type_int32_type = 28,
    modern_node_type_int64_type = 29,
    modern_node_type_nat8_type = 30,
    modern_node_type_nat16_type = 31,
    modern_node_type_nat32_type = 32,
    modern_node_type_nat64_type = 33,

There's also:

    modern_node_type_float32_value = 16,
    modern_node_type_float64_value = 17,
    modern_node_type_float32_type = 34,
    modern_node_type_float64_type = 35,

But I believe I looked into this at one point and determined that the IEEE float standard specifies one particular byte order, although it describes quite a few lengths. (There's verbiage somewhere in the docs about why Modern Data doesn't have types for 80-bit or 128-bit floats. Briefly, because they're too much portability nuisance.)

So the 8-bit types don't need to worry at all, of course, and I propose to leave those as they are. There have been architectures in the past that use orders such as "2 3 0 1", but I'm inclined to let those be left to a more-general mechanism and only deal with "0 1 2 3" and "3 2 1 0" (little-endian and big-endian respectively). There's certainly no optimization possible for the strange orders, anyway. So that's two orders for each size.

In general I've tried to be extremely explicit with symbol names throughout Modern Data, and people have thanked me for it actually. There's already a few concepts that use multiple words, like divide_towards_zero vs divide_towards_negative_infinity. So I'm thinking of the following:

    modern_node_type_int16_little_endian_value = ...,
    modern_node_type_int16_big_endian_value = ...,

And similarly for all the other sizes > 8, for both the int and nat variants.

That does bring up the topic of the builtins; I suppose these will all also have to be forked. I'm a little distressed that the number assignments are going to wind up in weird orders, but I guess trying to keep them "logical" was always rather quixotic.

There should be new builtins which cast between little- and big-endian variants of the same sizes. The blob casts could do that, too, but the design already has direct coercions everywhere they make sense, ie. between same-sized ints and nats. And, actually, between every pair of different-sized ones, too. :) In most of these cases that's because sign-extension is not the same operation as a blob cast... but really it only makes sense to have these; they map directly to hardware capabilities.

There's one last question that comes up. Some languages, such as C, have type synonyms for the "native" endianness. I'm reasonably clear that this would be counterproductive to have in Modern Data, since it would introduce a context that makes a schema's meaning vary depending on where it's used. I'm against adding that right now.

tinyplasticgreyknight commented 9 years ago

The casting builtins are already extremely numerous; is there a way to maybe parameterise these? Perhaps something like:

apply:
  builtin cast_numeric:
    nat16
    big_endian
    int32
    little_endian
  $value

or

apply:
  builtin cast_numeric:
    nat16_big_endian
    int32_little_endian
  $value

I agree that platform-sized and platform-endian integers would be counterproductive. That sort of thing never seems to cause anything but trouble in my experience.

IreneKnapp commented 9 years ago

Well, yes they could be parametrized, but I'm not sure that's desirable; it adds a layer of complexity to the implementation, regardless of whether the encoding does something clever with bits, or adds additional bytes after. Granted, the implementation is going to be complicated no matter what, and I'm not sure my hope that people would feel able to make independent implementations is realistic...

tinyplasticgreyknight commented 9 years ago

I think it is likely that implementations would want to have some sort of internal abstraction for dealing with all the different cast combinations, even if it's not parameterised in the actual specification. :-)

EDIT: hmm, did you mean that only e.g. cast_int8_le_be would be supported, and not the completely general approach of e.g. cast_nat8_le_int16_be? I guess that cuts down on the amount of new casts quite a bit. There are still a very large number of casts even as things stand, though.

IreneKnapp commented 9 years ago

Oh - hmm. I hadn't thought of doing completely general casts.

Yeah, there's a large number of them. It's not great. Builtin identifiers are pulled from a sixteen-bit space, which is mostly empty, so it's doable, but...

IreneKnapp commented 9 years ago

Hmm... So, what might a parametrized approach look like? Exposing each of those sub-parts as its own byte is not great, but bitfields pulled out of the identifier would work. I'm a little averse to it still because it means there's not a 1:1 mapping between lexical words in the text format, and stream events... but there already isn't, now that I think about it, because actually any builtin is already one stream event but two lexemes, and there are a few similar cases like the numerical parameters to let and type_family.

So nat vs. int is one bit, and 8/16/32/64 is two bits. Endianness is one bit. A fully general cast would have two sets of those three specifiers, so eight bits in all, leaving eight bits to be the prefix that means "this is a cast".

Also - wow - that means there's 16*15 = 240 possible integer casts (the identity ones can be excluded). Even in the somewhat arbitrarily chosen less-general case of allowing a cast to convert either (nat/int, size) or endianness, but not both, there'd be 8*7*2 + 2*1*8 = 128. Thank you for getting me to pay attention to this; I agree that an abstraction is appropriate.

Additionally, there's non-integer types that have to be accounted for: name, blob, utf8, float32, float64. Not all pairs are possible but many are, so let's allocate five bits for specifying each operand type in a cast.

It might be nice if there were consistency with the node-type codes; hm. ... Without renumbering all the nodes, I don't see an easy way to do that, so never mind.

The numbers currently allocated for casts are already up in the stratosphere, starting at 10240, which is 0x2800. Let casts be the builtins where (id & 0xFC00 == 0x2800), thereby leaving ten bits for use. Just in case we have to reuse this pattern elsewhere, let the low five bits be the input-operand type, since all the builtins have an input type but only for casts is there a fully-general output-operand type, so the decoding might look cleaner that way. Then that leaves the five bits near the middle to be the output-operand type.

That means the whole thing looks like:

0010 10oo oooi iiii
  i -> input operand type, per below
  o -> output operand type, per below

And let the operand types be specified as:

0esxx
  e 0 -> little-endian
     1 -> big-endian
  s 0 -> nat
     1 -> int
  xx 00 -> 8-bit
       01 -> 16-bit
       10 -> 32-bit
       11 -> 64-bit
10000 -> name
10001 -> blob
10010 -> utf8
10011 -> float32
10100 -> float64

And it's disallowed for e to be 1 while xx is 00.

Feedback? :)

IreneKnapp commented 9 years ago

This looks like a good way to do it. Marking this "pending implementation". :)