digama0 / mm0

Metamath Zero specification language
Creative Commons Zero v1.0 Universal
315 stars 40 forks source link

mmb ? #17

Closed Lakedaemon closed 3 years ago

Lakedaemon commented 4 years ago

Hello. Is there a feature complete description of the mmb binary format available somewhere ?

digama0 commented 4 years ago

The best documentation at the moment is mm0-c. I'm not really sure how to present binary formats except using structs in the C style. types.c has all the main structures in the file.

c-cube commented 4 years ago

I think it'd be a lot cleaner, simpler, and future proof to use msgpack or something similar to avoid depending on (non-standard? compiler-dependent?) C struct layout.

Parse, don't validate, especially in C. Msgpack is compact, portable, endianess-agnostic, and has implementations in most languages.

digama0 commented 4 years ago

I'm not using C struct semantics for the definition, but I don't know a better way to present binary formats. I really just want to say something like "32 bits of data for this, then 8 bits of that, then 8 bits of this other thing, etc". I'm using "packed" structs to try to control the C compiler so it does what I want, and assert to make sure it's working, but the actual definition is just a bunch of fields of defined sizes. The definition of the format itself is not platform or compiler dependent.

digama0 commented 4 years ago

I am not using msgpack for two reasons: (1) I can get better performance with a bespoke format (2) I don't have to verify a format that does more than necessary.

c-cube commented 4 years ago

But then you still have to specify endianess. And are you sure your encoding is actually more efficient? :-)

digama0 commented 4 years ago

Yep. Little endian.

digama0 commented 4 years ago

It may not be more efficient, but it is not difficult to incorporate the best parts of the format if there are things that can be done differently. Really, the format has more in common with an instruction set architecture than a binary serialization format. In particular, you can't seek in a serialization format.

c-cube commented 4 years ago

I know I won't convince you, but I think it's cleaner to parse everything into memory and then seek all you want. As it is I think it'll be very hard to write a checker for mm0 in anything but {C,C++,rust}.

digama0 commented 4 years ago

For other languages you probably won't be able to do the in place thing, sure, but you can still load large byte arrays in most languages, and if you don't care about efficiency you can just do streaming binary parsing like you would for msgpack.

digama0 commented 4 years ago

There is already an MMB writer in haskell, and a reader is planned. Yes it's probably going to be imperative looking, but that's how computers work.

c-cube commented 4 years ago

Bit fiddling is tricky and a big source of bugs, in my opinion. That's why I'd avoid requiring each checker to redo a custom binary parser. Do you validate input at all in C? If you do, it should not be much slower to parse, as you have traverse the data once anyway.

digama0 commented 4 years ago

Note that the format itself is optimizing not for {C,C++,rust} exactly, but rather the underlying assembly program. Anything that makes that program faster or easier to write. So of course there is going to be a lot of fixed length integers and direct indexing, because that's what computers do best.

The entire verifier is nothing but a big validation routine. It doesn't "do" anything, it just checks that a complicated predicate holds of the input byte array.

The reason I want seekability is because besides the main streaming part there is a big table of theorems that needs to be accessible at all times. The verifier has very little state, but it's validating the theorem table at the same time as the proof stream. You can copy the data into your own tables if you want, but that's more complicated for the assembly program because you have to do memory allocation.

The MMB format itself is optimized for the mm0-c verifier (or its assembly counterpart). It doesn't have to be the interchange format, although it's the one that is going to have theorems proven about it.

digama0 commented 4 years ago

Bit fiddling is tricky and a big source of bugs, in my opinion. That's why I'd avoid requiring each checker to redo a custom binary parser.

Indeed, that's why I'm not leaving things to chance - everything about the format and the program dealing with it will be verified.

c-cube commented 4 years ago

The MMB format itself is optimized for the mm0-c verifier (or its assembly counterpart). It doesn't have to be the interchange format, although it's the one that is going to have theorems proven about it.

I guess that's the part I was missing. It makes more sense to me then.

digama0 commented 4 years ago

In the mm0.md specification, it says that while the general semantics, the inductive specification of theorem proofs, is defined, the specific format (the bit layout) is implementation defined. MMB is the format that mm0-c uses.

Lakedaemon commented 4 years ago

what if you make mmu into the interchange format ?

Then, you can just say : mmu is a big utf8 string in network order and then, you are done for porting/interoperability (which is not the main concern of the mm0 project but a very nice benefit)

verifier should implement a mmb-like binary format for performance reasons (but if you implement exacly mmb, you get more verification goodness) and should implement mmu->binary format and binary format-> mmu conversion

Lakedaemon commented 4 years ago

I'll look at the structs in types.c, that should give me a good enough understanding of mmb. Thanks :)

digama0 commented 4 years ago

The mmu format is too inefficient for this purpose. It should be a binary format, and it should have most or all of the optimizations in the mmb format. Conceivably it could be cast into a standardized binary protocol like msgpack, which would at least make it easier for people to get over the parsing hurdle.

Lakedaemon commented 4 years ago

JSon is a text based solution and is pretty inefficient too. Still, it is one of the most ubiquitous solution for serialization/deserialization/exchange/glue format.

What are the best features of an exchange format ? I would put at the top of my list : 1) Adoption/popularity/ubiquity 2) future proof (binary sucks when someone decides to design hardware differently) 3) portability and simplicity (if it is easy to implement, it drives 1) 4) short files (with compression) because it is intended to be sent over the internet 5) expressivity capabilities (rosetta stone effect, needed for 1)

c-cube commented 4 years ago

Msgpack is, in essence, a binary json. It checks all the items on your list :)

Lakedaemon commented 4 years ago

So, it'll be fine by me :)

digama0 commented 3 years ago

Fixed in 59054ae