Closed Lakedaemon closed 3 years ago
The best documentation at the moment is mm0-c
. I'm not really sure how to present binary formats except using structs in the C style. types.c has all the main structures in the file.
I think it'd be a lot cleaner, simpler, and future proof to use msgpack or something similar to avoid depending on (non-standard? compiler-dependent?) C struct layout.
Parse, don't validate, especially in C. Msgpack is compact, portable, endianess-agnostic, and has implementations in most languages.
I'm not using C struct semantics for the definition, but I don't know a better way to present binary formats. I really just want to say something like "32 bits of data for this, then 8 bits of that, then 8 bits of this other thing, etc". I'm using "packed" structs to try to control the C compiler so it does what I want, and assert to make sure it's working, but the actual definition is just a bunch of fields of defined sizes. The definition of the format itself is not platform or compiler dependent.
I am not using msgpack for two reasons: (1) I can get better performance with a bespoke format (2) I don't have to verify a format that does more than necessary.
But then you still have to specify endianess. And are you sure your encoding is actually more efficient? :-)
Yep. Little endian.
It may not be more efficient, but it is not difficult to incorporate the best parts of the format if there are things that can be done differently. Really, the format has more in common with an instruction set architecture than a binary serialization format. In particular, you can't seek in a serialization format.
I know I won't convince you, but I think it's cleaner to parse everything into memory and then seek all you want. As it is I think it'll be very hard to write a checker for mm0 in anything but {C,C++,rust}.
For other languages you probably won't be able to do the in place thing, sure, but you can still load large byte arrays in most languages, and if you don't care about efficiency you can just do streaming binary parsing like you would for msgpack.
There is already an MMB writer in haskell, and a reader is planned. Yes it's probably going to be imperative looking, but that's how computers work.
Bit fiddling is tricky and a big source of bugs, in my opinion. That's why I'd avoid requiring each checker to redo a custom binary parser. Do you validate input at all in C? If you do, it should not be much slower to parse, as you have traverse the data once anyway.
Note that the format itself is optimizing not for {C,C++,rust} exactly, but rather the underlying assembly program. Anything that makes that program faster or easier to write. So of course there is going to be a lot of fixed length integers and direct indexing, because that's what computers do best.
The entire verifier is nothing but a big validation routine. It doesn't "do" anything, it just checks that a complicated predicate holds of the input byte array.
The reason I want seekability is because besides the main streaming part there is a big table of theorems that needs to be accessible at all times. The verifier has very little state, but it's validating the theorem table at the same time as the proof stream. You can copy the data into your own tables if you want, but that's more complicated for the assembly program because you have to do memory allocation.
The MMB format itself is optimized for the mm0-c verifier (or its assembly counterpart). It doesn't have to be the interchange format, although it's the one that is going to have theorems proven about it.
Bit fiddling is tricky and a big source of bugs, in my opinion. That's why I'd avoid requiring each checker to redo a custom binary parser.
Indeed, that's why I'm not leaving things to chance - everything about the format and the program dealing with it will be verified.
The MMB format itself is optimized for the mm0-c verifier (or its assembly counterpart). It doesn't have to be the interchange format, although it's the one that is going to have theorems proven about it.
I guess that's the part I was missing. It makes more sense to me then.
In the mm0.md specification, it says that while the general semantics, the inductive specification of theorem proofs, is defined, the specific format (the bit layout) is implementation defined. MMB is the format that mm0-c
uses.
what if you make mmu into the interchange format ?
Then, you can just say : mmu is a big utf8 string in network order and then, you are done for porting/interoperability (which is not the main concern of the mm0 project but a very nice benefit)
verifier should implement a mmb-like binary format for performance reasons (but if you implement exacly mmb, you get more verification goodness) and should implement mmu->binary format and binary format-> mmu conversion
I'll look at the structs in types.c, that should give me a good enough understanding of mmb. Thanks :)
The mmu format is too inefficient for this purpose. It should be a binary format, and it should have most or all of the optimizations in the mmb format. Conceivably it could be cast into a standardized binary protocol like msgpack, which would at least make it easier for people to get over the parsing hurdle.
JSon is a text based solution and is pretty inefficient too. Still, it is one of the most ubiquitous solution for serialization/deserialization/exchange/glue format.
What are the best features of an exchange format ? I would put at the top of my list : 1) Adoption/popularity/ubiquity 2) future proof (binary sucks when someone decides to design hardware differently) 3) portability and simplicity (if it is easy to implement, it drives 1) 4) short files (with compression) because it is intended to be sent over the internet 5) expressivity capabilities (rosetta stone effect, needed for 1)
Msgpack is, in essence, a binary json. It checks all the items on your list :)
So, it'll be fine by me :)
Fixed in 59054ae
Hello. Is there a feature complete description of the mmb binary format available somewhere ?