Arnau478 commented 4 months ago

This is a long-term idea that won't be implemented until version v2.0.0. The idea is to have the parsers be a separate file that the user can download and manage. So the parsers wouldn't be built-in. It's a very nice way to offer a large variety of parsers without making the binary huge. There are a few ways to do it:

Plain text files which define a format/parser in a custom domain-specific language (DSL). Similar to ImHex's approach.
Shared libraries that are dynamically loaded
WASM modules

Comparison

Approach	Speed	Ease of use	Portability	Impact on the main executable
DSL	The slowest	Extremely easy to create new parsers	Architecture and OS-agnostic	The executable would have to include a whole interpreter for the language
Shared library	The fastest	Very hard to create or distribute parsers	The compatibility matrix is huge	Almost no overhead on the main executable
WASM	Pretty slow, but not as slow as the DSL	Very hard to create or distribute parsers	Architecture and OS-agnostic	The executable would have to include an entire WASM VM

Back-compatibility

Either way, this would break how parsers work. That could be an argument to put it in the v1.0.0 release. However, this would take a lot of time and would postpone the release even more (I've wanted to release it for a while now).

Arnau478 commented 1 month ago

I would appreciate help on where to put dynamically-loaded files on other OSes. For linux I'm pretty sure it would be local/share, but I have no idea where those would go on other platforms.

GasInfinity commented 1 month ago

To add my two cents:

IMO, the DSL would be the best pick because if done correctly, it shouldn't have a major impact on the speed of the whole program. Maybe the DSL bytecode could be cached?

About the WASM interpreter; it could work but the binary size would grow quite a lot and if its not jitted or well optimized, I don't think it would be as fast as a custom DSL.

PS: Shared libraries as parsers would be very fast but making them would be a hassle. Everything would need to be in the C ABI and if you have to make a change to the ABI, all parsers would need to be recompiled.

2nd PS: Maybe you could store parsers on Windows in a subdirectory inside AppData\Roaming?

Arnau478 commented 1 month ago

IMO, the DSL would be the best pick because if done correctly, it shouldn't have a major impact on the speed of the whole program. Maybe the DSL bytecode could be cached?

I have to admit I lean towards the DSL. It does seem like the best option.

Right now my main concern is whether to make it an imperative language (i.e. you write code that describes how to parse a file format, similar to the current approach) or a descriptive one (i.e. defining the structure of the file format, and letting hevi parse it following it as a "guide").

About the caching thing: caching the bytecode would be pretty important in case the imperative route is taken. And, if it ends up being a descriptive one, it's not really an issue, as there would be no need for a bytecode IR.

About the WASM interpreter; it could work but the binary size would grow quite a lot and if its not jitted or well optimized, I don't think it would be as fast as a custom DSL.

100% right. Plus, WASM APIs are usually pretty cumbersome. Basically, it has the disadvantages of all other options, while making the executable significantly bigger.

PS: Shared libraries as parsers would be very fast but making them would be a hassle. Everything would need to be in the C ABI and if you have to make a change to the ABI, all parsers would need to be recompiled.

I also don't like the fact that they cannot be easily sandboxed...

2nd PS: Maybe you could store parsers on Windows in a subdirectory inside AppData\Roaming?

That seems pretty reasonable. Thanks!

GasInfinity commented 1 month ago

A descriptive DSL would be the perfect solution, at least that is what I want to say.

However, it depends greatly on how much do you want to highlight the file, should it be simple? where hevi only parses the headers and some checksums (and even as simple as that could still be a challenge) or very specific? (like highlighting the chunks of a .png image or the sections and pools of a .class file.)

Unfortunately, file formats are very different and complex on their own and developing a description that works for all of them would be almost impossible.

If just simple highlighting is needed, then maybe it could work.

And, if it ends up being a descriptive one, it's not really an issue, as there would be no need for a bytecode IR.

AFAIK, I think that even caching the tokenized DSL could be useful if speed is important.

Arnau478 commented 1 month ago

I was thinking of something similar to the following:

@root = struct {
    magic: 16 = 0xFF00,
    8,
    version: 8,
}

where you assign to an identifier to create a construct (with the special identifier @root being the "entry point"). The most basic construct would be a "struct" (not to be confused with C, Zig, etc. struct) which is a contiguous non-padded region of memory. It would then contain "fields" in the form (<name>:)? <construct> (= <value>)?. Thus, magic: 16 = 0xFF00 would mean "16 bits of memory containing 0xFF00" (8, 16, etc. would be elementary constructs). Similarly, 8 would mean "8 bits of memory that can contain anything". The fact that it doesn't have a name could be used by the highlighter to infer the "importance" of a field (i.e. padding wouldn't have a name and thus would be dimmer).

My main concerns are:

Non-tree formats

Most file formats form a tree-like structure of slices of memory, with every children slice being contained in its respective parent slice. Things like ELF don't. There's a header that is of tree form, but the actual contents are found with a given offset and size in the header. I don't know how to express this properly. One idea is to, when foo: *Foo is found, parse a construct Foo at the address defined in that memory address.

EOF assertion

If using the pointer approach, we can no longer assert for EOF (at least not as easily). When you do @root = 8 and pass a 2-byte file, it probably isn't supposed to succeed. But when pointers are in place, that second byte could be part of the file. Also, some file formats could allow padding at the end. I don't know how to integrate this in the syntax in a pleasant way, but being "added strictness" I don't think it's essential to have in the initial implementation. Fun fact: iirc none of the parsers check that right now.

Variable-size constructs

What about a file with a size field and some data structure that spans "size bytes"? Should it be something like [size]Foo? But then, the construct [size]Foo would only be valid directly inside another construct that has the field size.

Data interpretation

When using variable-sized constructs that depend on data of the file, or when conditionally parsing depending on some data, data interpretation becomes relevant. Maybe we should use u8 and i8 instead of simply 8 for elementary constructs? What about endianness? I'll have to think quite a bit about this...

GasInfinity commented 4 weeks ago

Looks great! The thing I was most concerned about were variable sized constructs and non-tree formats.

For example, numerous image formats like BMP and PNG define its headers in a non conventional way, so something like a conditional type would be needed.

When you do @root = 8 and pass a 2-byte file, it probably isn't supposed to succeed.

If following the pointer approach, EOF assertions could be enabled if it doesn't contain any.

Variable-size constructs

This is a must, what you described should be what it does, and it would be even better if simple expression evaluation is implemented (i.e. [width*height*channels]b8). Maybe adding something that means spans until EOF could be useful (like ...)

Data interpretation

IMHO, It's better to have a prefix on elementary constructs that conveys their meaning like b8 (binary blob of 8 bits), u8 (unsigned 8 bit number) and i8 (signed 8 bit number).

About endianness, having suffixes for them like b or l could be considered. If not using any kind of suffix, host endianness is assumed.

So, for example and based on what you've said and I mentioned, this could be a description for a qoi image parser (Seeing that one is already implemented in zig):

@root = struct {
    magic: b32 = "qoif",
    width: u32b, // Big endian
    height: u32b, // Big endian
    channels: u8,
    colorspace: u8,
    encoded_image: ... // Anything else to say "spans until EOF"?
}

Am I correct?

Arnau478 commented 4 weeks ago

I really like your suggestions, specially the ... thing :+1:

About endianness, having suffixes for them like b or l could be considered. If not using any kind of suffix, host endianness is assumed.

Instead of assuming host endianness (which is not something that really appears on file formats afaik, and would just make parsers misbehave in some architectures) there could be a global "property" (e.g. #endian: little or something like that). It also plays nicely with the "everything is X-endian unless otherwise specified" thing that appears on many specifications.

Arnau478 / hevi

Dynamically load parsers #44

Comparison

Back-compatibility

Non-tree formats

EOF assertion

Variable-size constructs

Data interpretation