PMunch / binaryparse

Binary parser for Nim
MIT License
71 stars 6 forks source link

Endian control #8

Closed sealmove closed 3 years ago

sealmove commented 3 years ago

With this feature the user can control endianness with l/ b.

PMunch commented 3 years ago

Looks good, but we should check and verify that only fields with a full set of bytes are possible to attach the be/le rule to. For example a leu4 doesn't make any sense as little-endian refers to the byte-endianess and u4 is only 4 bits long. So be and le should only be available when size mod 8 == 0.

sealmove commented 3 years ago

Yes, it's an issue to address. What you are saying is not entirely true though. You can have the concept of endianness within a byte. In fact tools like KS allow you to specify both endianness and bit-endianness separately. For example you could parse 14 bits in either of the following orders:

1  2  3  4  5  6  7  8     9 10 11 12 13 14  -  -
or
8  7  6  5  4  3  2  1     -  - 14 13 12 11 10  9

The remaining bits (for either case) are discarded in this case (for either senario) to keep alignment, unless the next (consecutive) field is also a bit-field (which means not a multiple of 8).

PMunch commented 3 years ago

Yes, you can, but this doesn't do bit-endianess it does byte-endianess and should therefore not work when it doesn't apply. If someone wants to add that in the future then an additional thing can be added, but for now it should verify that size mod 8 == 0.

sealmove commented 3 years ago

Agreed. Just wanted bit-endianness to be mentioned here for future reference. I'll implement the checking, keep the PR open.

sealmove commented 3 years ago

Before this makes it into the syntax, we should discuss the alternative of postfixing instead of prefixing. Ex. u8le instead of leu8. I think postfix is the more natural syntax. The reason I chose prefix initially is because it was easier to implement(le/be are expected always at slice 0..1 of the identifier string). Anyhow since we are extending the syntax it's worth the debate. For example how would the syntax look like if bit endianness is also implemented? Have to be smart now so that the syntax is kept extensible and not overly complicated.

PMunch commented 3 years ago

I disagree with it being more natural. I read leu8 as "little-endian unsigned 8-bit word" while u8le reads "unsigned 8-bit word, little-endian". Since it defaults to big endian it would also be possible to use lu8 and u8 for the two versions (and using only one letter, as all the other symbols are single-letter). Unless of course we want to (and I think we might) set the endianess as the default for an entire block, in that case we would need both b and l modifiers. Initially I was also thinking of using < and >, but those might be confusing, and possibly worse to implement. Definitely a discussion to be had, these are just the ideas I've had, curious what you think.

sealmove commented 3 years ago

I like your idea of keeping all symbols single-char. This way in the future we could even allow them to be ordered arbitrarily (ex. ul and lu could both be valid and have the same effect), just like unix short options on the command line. Now I am thinking of such a syntax where we have many different single-char specifiers before the size.

Using prefix operators like </> is equally easy to implement since we can check if node is nnkPrefix, unwrap it, specify endianness (or whatever information we want to convey with the operator identifier), and then continue processing the inner node. It's a matter of whether we want to have 2 separate interfaces, this is: first process operator identifier (which can also have many single-char symbols in arbitrary order), and then the process the operant identifier (which is single-letters followed by a number).

Thoughts/Questions:

  1. What could the letters for little/big bit endianness be?
  2. How would you go about specifying default endianness for an entire block?
mhessler97 commented 3 years ago

Are there instances where the endianess will change within a single binary file? I was thinking it would just need to be an optional input parameter to the to the createParser macro

createParser(simple, littleEndian):
    u8: unsignedeightbitinteger

with littleEndian being the enum from the systems module

sealmove commented 3 years ago

Are there instances where the endianess will change within a single binary file?

Sure.

I was thinking it would just need to be an optional input parameter to the to the createParser macro

This mixes up things though because these extra parameters are currently only used to provide user with fields defined in an outer parser. It might work (or might not) if it is a named parameter

createParser(simple, endian = littleEndian):
    u8: unsignedeightbitinteger

EDIT: It does work as mhessler97 shows because a single identifier is used for endianness, while other fields are colon expressions. But what if we keep adding options? I think the option should have a name and a value. Since colon expression is used for fields, I think we should use assignment syntax.

PMunch commented 3 years ago

I agree that you should be able to set the default endianes for a parser. Not quite sure what you mean @sealmove.

sealmove commented 3 years ago

Shall we merge this?

Not quite sure what you mean @sealmove.

Options should have names and values rather than being identifiers (endian = littleEndian instead of littleEndian).

createParser(simple, endian: littleEndian)

Or

createParser(simple, endian = littleEndian)

Second one is preferable because it differs from passed-down-field argument syntax (field: type)

PMunch commented 3 years ago

I'll have to look over your most recent changes, but this looks good so far. We can add the default later. Expect this to be merged very soon.