Containers, Modules, and wasm

m4b commented 9 years ago

Preliminaries

So this issue will touch on a few of topics, and it's fairly long (apologies!), but essentially I'm hoping to tease out what I see are some tensions with respect to container formats, wasm modules, and wasm itself.

Before I begin, I'm going to give a cursory definition/example of those three things, so it's (hopefully) clear what I mean:

a container is a binary, on-disk file format commonly used for storing and representing executable files, object files, core dumps, etc. Common examples of this are: ELF (used by Linux, and modern BSD variants); Mach-o (used by OSX); and PE (used by Microsoft)
a wasm module is a sequence of bytes (on disk or over the wire) which
1. describes the meta-data associated with a group of wasm functions/routines/expressions
2. contains the wasm code itself;
since it seems that what exactly this looks like and its exact specifications are still being determined, I'm primarily referencing the structure described in the Binary format global structure. So when I say "wasm module", or just "module", I am referring to this structure.
wasm is the turing complete programming language, whose semantics are described here. So when I say "wasm", or "wasm itself", I am referring to this language, and not the module or container in which it is embedded.
Module = Container or Module != Container ?

Now the first tension, and this was highlighted in a PR comment by @jfbastien for the global structure description, was that the wasm module spec looks like a container format specification; but this seems to be in direct contrast to #74, which discusses pros/cons of container formats for wasm itself.

In particular, when I read the wasm module description, and combine this with discussions of wasm container formats, I see this (rough) diagram:

elf-wasm gv

There are a number of immediate problems with this:

Large redundant swaths of meta-data, offset tables, etc.
ELF specific concerns:
- What will _DYNAMIC do? Does it export the wasm modules exports, or does it just point to a monolithic block of bytes (the module)? Similarly, what are the dynsym offsets pointing to?
- Do the string tables contain wasm function names?
- What are the point of the program headers? If wasm isn't compiled to the native cpu's ISA, I think it will have to point to an wasm dynamic loader/interpreter using INTERP, or if it is compiled to native x86 the pre-compiler or linker will have to relocate all of the symbols, map the module exports to ELF symbols pointed to in the _DYNAMIC array, etc., with the final result being a destructive one-way trip to ELF land
- Similarly with the section headers; what do they point to?
Where is the wasm module embedded? Is it a giant, monolithic section as in the diagram, which is then loaded by the dynamic loader or interpreter, or is the structure teased apart and planted into the ELF format (again, doing so is a one-way trip, you almost certainly won't be able to recover the wasm module once it's been encoded into ELF)?

For these reasons, in the wasm design spec it is still unclear where export symbol information, opcode tables, and other meta data concerns are going to be placed, and importantly whether it is:

at the wasm module level && module = container
at the wasm module level && module != container (what is the point of the container then?)
at the container level && module != container (what is the point of a module then?)

(I've ignored cases at the wasm source language level, similar to a C extern, because I don't think this granularity of linkage is being considered for wasm)

Is the wasm design spec on github a wasm spec, or a wasm and wasm module spec?

Currently, it seems like the latter; but then the problem above needs to be clarified (asap, I should think).

If it is the former however, then discussions revolving around export details, debugging symbol format/details, etc., are red herrings. These are all strictly container/module/linkage concerns; there is no such thing as an exported symbol in x86, just as there is no such thing in the semantics of wasm itself. This concept (of exported symbols, etc.) only occurs at the meta-data or linkage level (c's extern, etc.), which then gets translated specifically to whatever container format with whatever instruction set is being used.

So if the wasm design spec is a wasm language spec, but not also a wasm module spec, then strictly speaking, there shouldn't be any discussions concerning exported symbols, debugging symbols, etc., or at most they're cursory and provide minimal implementation commitments because their existence/definition is entirely orthogonal to a pure language spec.

For my own part, I hope this isn't the case; one of the things that first attracted me to wasm was not so much the language itself (which is awesome), but what I thought was the potential for a new container spec to correct deficiencies in previous formats. In my opinion, a strong, consistent, logical, and easily analyzable container format/module definition, where module = container, would do more for portability than mandating ELF as the container format because llvm has a backend for ELF, or because it's commonly used, or etc.

There are many problems with the ELF format imho (and many great things), that I won't go into for length considerations; but the structure sketched currently for wasm modules already seems more promising to me; if the module forces certain byte patterns, this makes it easier to generate modules, and to reason about those modules. E.g., if the module's on-disk byte representation is a linear byte layout consisting of something (roughly) like:

fixed size header +
fixed size sections header +
(num section headers * fixed size section header) +
required definitions size +
required code size +
required exports size +
required imports size +
optional additional headers size

and deviations from this are invalid modules, then the wasm module becomes a very easy target for backend writers.

If someone wants to write a wasm module -> ELF, or wasm module -> Mach-o, (and they will), then this seems to do more for portability than again, mandating ELF is the container format for wasm modules, and here, e.g., is how the module maps to ELF, and etc. (and then someone would have to write an ELF wasm binary -> Mach-o, etc., if they wanted to "port" it to run natively on OSX, which would be much harder than wasm module -> Mach-o)

Much of this is touched upon in #74, so I won't go into detail, but I would recommend revisiting that topic because as it stands, as I think this is a design decision that needs to be resolved.

Conflating wasm with its container

Lastly, sometimes there seems to be unintentional conflation between wasm modules and wasm itself, to varying degrees (I've been guilty of this myself, because I assumed wasm had its own container format, and that it was apart of the wasm spec). I hope to have clarified my usage with wasm/wasm module. But for example in #249, many answers mentioned that wasm could be targeted outside of JS. But that is an answer about compiling to wasm the language, not an answer about the use of wasm's module/container as a universal container; a universal binary format container does not dictate (in principle) the languages it can contain; just like ELF doesn't require only x86 instruction set byte sequences. Now, this might be the intent of this question, essentially whether it could be used as a universal bytecode; but the bytecode still needs a container, especially when exports and imports are involved etc.; and choosing ELF or PE or mach-o is far from a universal binary format (since something has to parse those containers to get the bytecode).

This point can be further pressed on how we answer the following question:

Is the wasm module spec only compatible with wasm itself? (the current version, which includes opcode tables, does seem to suggest this)
or, true to a binary container format like ELF, it can contain arbitrary instruction sets/languages (like ELF can), and specifies which symbols are exported, imported, what their offsets are, where the code is, etc.

I'm not sure either of these have easy answers; if you choose 1, then all of the problems with what the module is/does, as detailed above, if it is a container, etc., arise; if you choose 2. then there's (a lot) more work to be done, and other non-wasm considerations.

Conclusion

There are lots of ways it could go, but here is one scenario I'll propose to argue about/discuss:

the wasm module spec will only be compatible with itself (does not support other assemblies)
the wasm module is a container
until wasm modules become the sine qua non of universal and free as in freedom computation, backends (if they desire), can implement a wasmc functionality, which takes a wasm module and outputs the os's preferred native binary format (this will be a one-way, destructive trip)
Because 1. and 2., interpreters built for a system can load wasm modules and run them cross-platform.

That situation might look something like the following diagram:

wmodule gv

lukewagner commented 9 years ago

I think our shared goals are what you described: the wasm spec defines exactly how you represent a module as a binary (code and container) and focuses only on wasm. I think #249 was a little too general to answer effectively (which is why it was closed after the concrete suggestions were incorporated); I think the important thing is not to expand our scope to more than we've set out to accomplish in HighLevelGoals.md (an easy way to kill a project).

Independently of this, the question (asked by #74) is whether we should define our wasm binary format in such a way that some ELF tools might Just Work (or at least require less work to make work). You're right that it would be awkward; so we need to do a careful cost/benefit analysis (happy to have your input on that). But whatever the outcome, the wasm spec would fully determine the binary representation of a module.

sunfishcode commented 9 years ago

WRT the first diagram above, if we use ELF, what'd we do is define a mapping from the wasm spec to ELF, rather than just embed wasm's headers inside of ELF. A wasm section would just be an ELF section, an wasm section header would just be an ELF section header, and so on. If there are fields that don't map well, we'd probably make a new custom section for those fields. We wouldn't want to duplicate information.

m4b commented 9 years ago

You're describing something like this scenario:

wasm-elf gv

As I mentioned, this will almost certainly be a one-way, destructive trip, as you could lose information.

Another thing to be aware of is that the ELF specification allows:

regular symbol stripping provided by strip, and what nm and other binutils display;
section stripping (all section headers are removed and stripped). A program on some linux distros called sstrip provides this functionality; many vendors will run this on ELF binaries to protect their IP (it's common in android system binaries like the radio drivers, etc.).

Really, the only thing you need to run an ELF binary are program headers with an interpreter name (dynamic loader), an entry point, and a _DYNAMIC array (and I believe this is even optional). This doesn't seem to map well to a wasm module...

sunfishcode commented 9 years ago

Yes, in the fullness of the spirit of ELF, we'd replicate all the essential information from Shdrs in the Phdrs too. I've recently heard ELF described as "smart format, dumb tools", reflecting that the format has specialized data structures like Phdrs intended to keep individual use cases like executing the program simple, even though it ends up duplicating information to do so.

I personally am increasingly of the opinion that ELF brings more baggage than benefit, but it's still an open question.

m4b commented 9 years ago

Of course if the wasm module is given a clean definition, it shouldn't be too difficult to create a wasm -> ELF, if desired. This is sort of the situation I was hinting at in the second diagram, in the original post; if the wasm module is considered a container in it's own right, and wasm interpreters are built to parse and execute them, then you've actually got more portability options, the interpreted route, but also a backend, which I called wasmc, which takes the wasm module, and outputs the appropriately formatted native container (with the caveat you probably can't go back from that container to the wasm module, etc.).

sunfishcode commented 9 years ago

A wasm file will obviously need to hold enough information to be correctly interpreted, and one of my assumptions is that this level of information is sufficient to correctly translate it into a native binary program in a native binary container too. Under this assumption, the wasmc idea you describe will be possible no matter what format we use.

m4b commented 9 years ago

@sunfishcode

Under this assumption, the wasmc idea you describe will be possible no matter what format we use.

Well, yes and no. If, for example, ELF was chosen as the container format for wasm, then if you wanted that to run natively on osx, you'd have to create an ELF -> mach-o.

As far as I know, there does not currently exist a program for doing this. From what I understand objconv, only converts object files, and not executable files, from ELF to mach.

If however, the container format is a spec'd out wasm module (i.e., a new container format), and is carefully crafted, going from wasm -> elf, and wasm -> mach-o should be easier than the alternative (ELF is chosen as the container format).

This is the situation I meant to describe, and which can be seen in the last diagram in the original issue.

sunfishcode commented 9 years ago

@m4b I still don't quite see what you're driving at. But from your mention of objconv, I can perhaps perceive a difference in perspective and underlying assumptions here. objconv seems to be about minimally translating between different container formats while preserving and remaining ignorant of as much as possible the contents. ELF->ELF transforms are likely simple, because most information can just be left in place.

My assumption about wasm modules is that translating to a native executable format would require fully understanding the entire wasm semantics and completely rewriting the contents to reproduce the semantics. Consequently, wasm ELF -> native ELF isn't going to be much different than wasm ELF -> native mach-o or wasm custom -> native ELF or other things.

As one example, wasm can't use ELF's PT_LOAD for text segments in the usual way, because the code is a virtual ISA that has to be compiled before it can be loaded into memory, and the code isn't allocated inside the application's address space (so we can't put it at p_vaddr). If we use ELF, I imagine we'd end up defining CPU-specific segment types (PT_LOPROC+x) to use for text instead. And if we do that, a translation tool won't be able to just copy or even transliterate the existing segment headers. It'll have to read them, understand what they're doing, and produce something new which achieves the desired behavior.

m4b commented 9 years ago

@sunfishcode sorry for the delay, and sorry for being unclear.

My position:

going from a custom wasm-module -> native formats will be easier than "ELF wasm" -> native formats
and easier is not only easier from a back-end author's perspective, but also from a portability perspective, since it requires a custom wasm-module, which makes the module more portable on another system imho. I believe the last diagram in the original post explains this last point better than I.

So I think your major point is in contradiction to 1.:

wasm ELF -> native ELF isn't going to be much different than wasm ELF -> native mach-o or wasm custom -> native ELF or other things.

Here is a detailed synopsis of the mach-o binary format. If you are familiar with ELF, which I believe you are, then you'll note that they're substantially different. There is invariably loss of information when translating between the two; e.g., there is no notion of size in mach exports, as there are in ELF dynamic symbols; there is no notion of exporting library per dynamic symbol in ELF, as there is in mach; etc. All of these awkward differences will have to be accommodated in a custom translator which is intimately familiar with how wasm works (as you said), how ELF works, and wasm specific ELF additions, and how mach-o and other native formats work.

So, what I'm suggesting/driving at is that it will be harder to go from:

ELF wasm -> Mach-o

than it will be to go from:

wasm-module -> ELF and wasm-module -> Mach-o

given a "nice" wasm-module container format (what I believe you're calling "wasm-custom").

The reason I think going from wasm-module -> Mach-o is better is probably best served as another discussion, and as I mentioned in the original post, applicable to the original issue of "what container format are we using", but cursorily:

you mention you'll have to add CPU specific segment types for the virtual ISA; this requires more work for someone parsing the ELF format now, so some of the arguments for using ELF for the "auto-tools" seems less applicable
no matter what format is used, any translator/compiler will "have to fully understand the entire wasm semantics" when generating modules or native container formats, so if this is an argument for ELF over a custom wasm module container format, I fail to see the force of it
given a well thought out "custom" wasm-module container format, many of the awkward differences between native binary formats can be avoided (some of which I mentioned above), or can be easily accomodated when translating/compiling from the custom wasm module format to the native format.

The last point is admittedly somewhat circular, and is in need of more explanation, but I'm not sure this is the right issue for it.

I hope this clears up my position.

sunfishcode commented 9 years ago

As your post suggests, one of the big questions here is: What will the semantics of dynamic libraries be? Is there a flat namespace or a two-level namespace? ELF and Mach-O are file formats at one level, but they also are traditionally associated with a lot of high-level semantics as well. And I agree that if we chose semantics that are close to a native ELF implementation's semantics, it may be easier to translate wasm into that native ELF format.

I've not done a full survey, but I've heard ELF's traditionally flat namespace described as being an attempt to remain compatible with static archive-based libraries which didn't actually provide consistent compatibility and caused trouble besides. And meanwhile, Mach-O folks continue to believe two-level namespaces are worththile. Unless other considerations arise, I'll likely suggest we follow the two-level approach, which wouldn't absolutely rule out ELF, but would be a significant point against it, and probably enough to close the question.

m4b commented 9 years ago

:+1: two-level namespaces. As you mentioned, this could be a pretty significant problem if ELF is chosen as the container format. Really the PE format uses it as well, since as I understand it, each import declares its exporting library (in mach it's an index into the libraries in the order they're declared) - whereas in ELF, and other flat namespace formats, the dynamic linker first searches the executing binary for the symbol, then the libraries in the order they're declared, etc.

For me it boils down to the pros and cons of this: when you choose a format, you choose that format's idiosyncrasies, and furthermore, force those idiosyncrasies into whatever format it's being translated to. In many cases, again as you point out, flat vs. two-level, this will become awkward at best.

As such, there exists the opportunity to choose our own idiosyncrasies by creating our own - and in the process, hopefully make them good ones (or at least reasonable).

Lastly, as you mentioned, much of this relates to the semantics of dynamic libraries, which albeit isn't part of the MVC. But for me, container formats are only interesting primarily in how they define and implement imports/exports for dynamic libraries. While this isn't a concern of the MVC, the decisions for the container format will directly impact the concrete implementation of dynamic libraries in the future, so it's something to be very cautious about, and consider all options, which I think we're doing :)

sunfishcode commented 9 years ago

In #74, I've now taken the discussion here to the next logical step and am proposing we use a custom format instead of ELF. Assuming this proposal is accepted, I believe we'll be in the scenario outlined in the Conclusion section and diagram in the initial post above, though the specifics of the custom format aren't designed yet. @m4b do you agree?

m4b commented 9 years ago

@sunfishcode agreed. I think you pretty much summed what needed to be summed up here.

And ditto on a new issue for the specifics of the custom container format, etc.

sunfishcode commented 9 years ago

Great. I'll close this issue then, and look forward to new issues once #74 is decided (either way -- if we do a custom format, to design it, and if we use an existing format, to figure out the mapping and any custom parts we'll need to add).

WebAssembly / design