Open Arnau478 opened 4 months ago
I would appreciate help on where to put dynamically-loaded files on other OSes. For linux I'm pretty sure it would be local/share
, but I have no idea where those would go on other platforms.
To add my two cents:
IMO, the DSL would be the best pick because if done correctly, it shouldn't have a major impact on the speed of the whole program. Maybe the DSL bytecode could be cached?
About the WASM interpreter; it could work but the binary size would grow quite a lot and if its not jitted or well optimized, I don't think it would be as fast as a custom DSL.
PS: Shared libraries as parsers would be very fast but making them would be a hassle. Everything would need to be in the C ABI and if you have to make a change to the ABI, all parsers would need to be recompiled.
2nd PS: Maybe you could store parsers on Windows in a subdirectory inside AppData\Roaming
?
IMO, the DSL would be the best pick because if done correctly, it shouldn't have a major impact on the speed of the whole program. Maybe the DSL bytecode could be cached?
I have to admit I lean towards the DSL. It does seem like the best option.
Right now my main concern is whether to make it an imperative language (i.e. you write code that describes how to parse a file format, similar to the current approach) or a descriptive one (i.e. defining the structure of the file format, and letting hevi parse it following it as a "guide").
About the caching thing: caching the bytecode would be pretty important in case the imperative route is taken. And, if it ends up being a descriptive one, it's not really an issue, as there would be no need for a bytecode IR.
About the WASM interpreter; it could work but the binary size would grow quite a lot and if its not jitted or well optimized, I don't think it would be as fast as a custom DSL.
100% right. Plus, WASM APIs are usually pretty cumbersome. Basically, it has the disadvantages of all other options, while making the executable significantly bigger.
PS: Shared libraries as parsers would be very fast but making them would be a hassle. Everything would need to be in the C ABI and if you have to make a change to the ABI, all parsers would need to be recompiled.
I also don't like the fact that they cannot be easily sandboxed...
2nd PS: Maybe you could store parsers on Windows in a subdirectory inside
AppData\Roaming
?
That seems pretty reasonable. Thanks!
A descriptive DSL would be the perfect solution, at least that is what I want to say.
However, it depends greatly on how much do you want to highlight the file, should it be simple? where hevi only parses the headers and some checksums (and even as simple as that could still be a challenge) or very specific? (like highlighting the chunks of a .png
image or the sections and pools of a .class
file.)
Unfortunately, file formats are very different and complex on their own and developing a description that works for all of them would be almost impossible.
If just simple highlighting is needed, then maybe it could work.
And, if it ends up being a descriptive one, it's not really an issue, as there would be no need for a bytecode IR.
AFAIK, I think that even caching the tokenized DSL could be useful if speed is important.
I was thinking of something similar to the following:
@root = struct {
magic: 16 = 0xFF00,
8,
version: 8,
}
where you assign to an identifier to create a construct (with the special identifier @root
being the "entry point"). The most basic construct would be a "struct" (not to be confused with C, Zig, etc. struct
) which is a contiguous non-padded region of memory. It would then contain "fields" in the form (<name>:)? <construct> (= <value>)?
. Thus, magic: 16 = 0xFF00
would mean "16 bits of memory containing 0xFF00" (8
, 16
, etc. would be elementary constructs). Similarly, 8
would mean "8 bits of memory that can contain anything". The fact that it doesn't have a name could be used by the highlighter to infer the "importance" of a field (i.e. padding wouldn't have a name and thus would be dimmer).
My main concerns are:
Most file formats form a tree-like structure of slices of memory, with every children slice being contained in its respective parent slice. Things like ELF don't. There's a header that is of tree form, but the actual contents are found with a given offset and size in the header. I don't know how to express this properly. One idea is to, when foo: *Foo
is found, parse a construct Foo
at the address defined in that memory address.
If using the pointer approach, we can no longer assert for EOF (at least not as easily). When you do @root = 8
and pass a 2-byte file, it probably isn't supposed to succeed. But when pointers are in place, that second byte could be part of the file. Also, some file formats could allow padding at the end. I don't know how to integrate this in the syntax in a pleasant way, but being "added strictness" I don't think it's essential to have in the initial implementation. Fun fact: iirc none of the parsers check that right now.
What about a file with a size
field and some data structure that spans "size
bytes"? Should it be something like [size]Foo
? But then, the construct [size]Foo
would only be valid directly inside another construct that has the field size
.
When using variable-sized constructs that depend on data of the file, or when conditionally parsing depending on some data, data interpretation becomes relevant. Maybe we should use u8
and i8
instead of simply 8
for elementary constructs? What about endianness? I'll have to think quite a bit about this...
Looks great! The thing I was most concerned about were variable sized constructs and non-tree formats.
For example, numerous image formats like BMP
and PNG
define its headers in a non conventional way, so something like a conditional type would be needed.
When you do @root = 8 and pass a 2-byte file, it probably isn't supposed to succeed.
If following the pointer approach, EOF assertions could be enabled if it doesn't contain any.
Variable-size constructs
This is a must, what you described should be what it does, and it would be even better if simple expression evaluation is implemented (i.e. [width*height*channels]b8
). Maybe adding something that means spans until EOF
could be useful (like ...
)
Data interpretation
IMHO, It's better to have a prefix on elementary constructs that conveys their meaning like b8
(binary blob of 8 bits), u8
(unsigned 8 bit number) and i8
(signed 8 bit number).
About endianness, having suffixes for them like b
or l
could be considered. If not using any kind of suffix, host endianness is assumed.
So, for example and based on what you've said and I mentioned, this could be a description for a qoi image parser (Seeing that one is already implemented in zig):
@root = struct {
magic: b32 = "qoif",
width: u32b, // Big endian
height: u32b, // Big endian
channels: u8,
colorspace: u8,
encoded_image: ... // Anything else to say "spans until EOF"?
}
Am I correct?
I really like your suggestions, specially the ...
thing :+1:
About endianness, having suffixes for them like b or l could be considered. If not using any kind of suffix, host endianness is assumed.
Instead of assuming host endianness (which is not something that really appears on file formats afaik, and would just make parsers misbehave in some architectures) there could be a global "property" (e.g. #endian: little
or something like that). It also plays nicely with the "everything is X-endian unless otherwise specified" thing that appears on many specifications.
This is a long-term idea that won't be implemented until version
v2.0.0
. The idea is to have the parsers be a separate file that the user can download and manage. So the parsers wouldn't be built-in. It's a very nice way to offer a large variety of parsers without making the binary huge. There are a few ways to do it:Comparison
Back-compatibility
Either way, this would break how parsers work. That could be an argument to put it in the
v1.0.0
release. However, this would take a lot of time and would postpone the release even more (I've wanted to release it for a while now).