Document expected import/export behavior for tools

Today, my compiler successfully emitted a working wasm file with the help of the LLVM back-end and the wasm-ld linker. Figuring out how to do so proved to be a frustrating and time-consuming challenge in large part due to breaking LLVM changes and a lot of undocumented behavior around how the tools handle import and export sections. Finding helpful answers required gathering together stray data cast around in obscure websites (and running across a lot of advice that is now wrong) and having to read in detail the LLVM source code used to generate and link wasm modules.

So ... I recommend that this repo contain a document on expected import/export behavior across tools. Some of this information should also be selectively disseminated elsewhere, where appropriate. Here are some information I believe would be useful to cover:

How LLVM (both the backend & linker) can decide by default what names to identify as exported and what to be imported. At some point between LLVM v5 and v7, the LLVM backend was neutered to be unable to generate a usable wasm file on its own: it generates no export section and hardcodes only two imports (for memory and table) with loader-unusable names. Not only is this generated wasm unusable, it forces the developer to manually enumerate what to export/import as part of the linker step, a mistake-prone process not required when creating object files and executables for native OS. In addition, the linker strips off the memory/table import and assumes they are to be exported instead.

Why would it not make more sense for wasm files generated by a compiler to be workable as is (without requiring a linker step) and let the compiled programs specify what to import/export based on the language's visibility attribute (a concept baked into LLVM) or alternatively based on the DLLimport/export (a concept also baked into LLVM)? If the linker is given wasm files with existing import/export sections, it need not strip or override them, but rather merge them. If no import/export info exists, or it needs to be overridden, the linker's many options currently supported can be used to add or change these settings. But it feels like the linker has less information available to make default decisions than the compiler does, so let the compiler lead if it chooses to.

Clearly document the linker's import/export manipulation options somewhere and describe explicitly what they do: --export, --export-dynamic, --export-table, --import-table, --export-memory, --import-memory, ---allow-undefined-file (!).
Establish a consistent default behavior re: --import-memory vs. --export-memory. Personally, I would select --import-memory as the default because it offers more flexibility in sharing and sizing memory across wasm modules. But whichever is chosen, there should be an agreement on the module name ('env'?) and the name of memory ('memory' vs. '__linear_memory'). Ditto for the table, which will play a larger role evidently in dynamic loading of wasm modules.

Note as well that the LLVM backend generates text-based wat files whose import/export sections don't match what is generated in the binary wasm file, making diagnosing problems harder because you don't expect to have to use the wabt tools to see what was really generated.

The Javascript documentation describing the instantiation also should be beefed up, to describe clearly how differently to handle when memory is imported vs. exported (and what that means), and that when imported, the import module should be named 'env' and the memory is expected to be called 'memory'. Ditto for the table's conventions.

I have no experience with the enscripten toolchain and backward compatibility issues, which no doubt complicate these decisions. I suspect I have gotten some stuff wrong (sorry). My intent here is to help make it easier for those that will come along afterwards. Perhaps other compilers (e.g., Rust or Zig, e.g., https://github.com/ziglang/zig/issues/1570) might also have valuable feedback on these standards before what people do solidifies too much more, making it impossible to corral in.

Thanks for feedback, and apologies for not having better documentation in place.

There is a lot of stuff to respond to here so let me know if I missed something.

Linker ABI documentation

The linker ABI/object file format documentation is intended to live in this repo under Linking.md and DynamicLinking.md. We do try to keep it up-to-date with what we implement in llvm but its not yet complete and can sometimes get out of sync.

Executable Object Files

I'm afraid that at this point this is a non-goal. lld or some compliant linker is required to produce the final output. We do hope to continue to make the object files valid wasm, but when we moved from imports/exports to an explicit symbol table we abandoned any possibility of executable object files. Note that no other llvm formats produce executable object files.

Linker documentation

This is still a work in progress but in addition to the normal --help docs the plan is to document this stuff at https://lld.llvm.org/WebAssembly.html. Unfortunately my last attempt to update this page had to be reverted: https://reviews.llvm.org/D52048. This needs to be re-landed ASAP.

Exported symbols

We do want to honor the symbols visibility settings when choosing which symbols to export. When building a DLL (still a work in progress) lld will currently export any symbols with visibility default. To get the same behaviour for executable you need to add --export-dynamic which matches the behaviour of native platforms. I agree this behaviour should better documented and I will add it to https://lld.llvm.org/WebAssembly.html.

In addition I am hoping to add a separate attribute to allow explicit exporting of symbols from executables (other than --export-dynamic which can often export too much and prevent GC). See: https://github.com/WebAssembly/tool-conventions/issues/64.

LLVM text-based output

I'm not sure what you are referring too here. llvm can generate .s assembly files, but it can't generate or consume the .wat format. Note that the .s format is still undergoing some changes.

Import module names

For now we use 'env' for all imports, mostly because this what emscripten did. There are plans to make this more flexible. See https://reviews.llvm.org/D45796. Eventually for dynamic linking we may want to start using the module name to form a two level namespace for DLLs so each import would include both the module name and the symbol name.

Builtin symbol names

There is proposed change to make this more consistent: https://reviews.llvm.org/D43675. These names should be documented in https://github.com/WebAssembly/tool-conventions/blob/master/Linking.md but currently are not. I've opened a separate bug for that: https://github.com/WebAssembly/tool-conventions/issues/82

Thank you for your helpful, detailed reply. I will respond below as per your sections, adding back anything missing. My intent here is to be helpful, and to that end, I am prepared to invest time (where my work permits) to write up words/documents or code changes that help further WebAssembly's success. I am also amenable to feedback that improves this process for you and others - e.g., breaking up my suggestions into discrete issues, posting them in other places, etc.

Linker Documentation

https://github.com/WebAssembly/tool-conventions/blob/master/Linking.md contains a lot of helpful information, but says almost nothing about how the linker handles import and export sections. I think it would be helpful if it indicated that the linker section effectively replaces (and ignores) import/export data coming in. I think it should also briefly explain how the linker section information translates to the creation of import/export sections.

The wasm-ld changes you are working on with https://reviews.llvm.org/D52048 look really helpful. Thank you for doing that. I have a few questions based on reviewing the draft there:

Is --export-memory missing from the list?
Are memory and table exported or imported by default? (should import be the default?)
I thought I remembered that --export=symbol used to be an option, is it now gone?
Is --allow-undefined the same as --allow-undefined-file? I think it would be helpful to indicate this generates imported symbols and that the data exists in a file where each symbol is on a separate line.
Is the linker able to auto-generate all necessary 'imports' (e.g., JS functions the module requires and has declared) without having to explicitly name them via --allow-undefined-file?
Where should it be documented that __heap_base and __data_end are always exported (and what they mean)?
The way a program specifies symbol visibility (visibility=default) is likely compiler/language-specific

In addition to this list of options, I think it would be extremely valuable to provide an educational section (as I think you mention intending to do in the 'exported symbols' response) that offers helpful context about the export/import implications of choosing these options. Let me know if you want me to put together a draft demonstrating what I think it should cover.

Executable Object Files

I understand no other LLVM format provides executable object files. That said, I don't see any obvious restriction in the wasm format that impedes this and it used to be possible. If possible, I would like to understand why this is a non-goal. Is there a public document I can review to understand the rationale? If not, is there a forum, private or public, where the technical obstacles can be explored?

"when we moved from imports/exports to an explicit symbol table" Ok! It finally clicked for me what you mean! If this radical change was documented and publicized somewhere, I failed to find it. I can guess why this change was necessary. But I still do not understand why this means the backend cannot also generate valid import/export sections, so that the generated wasm can be used as-is.

Importantly, where would a user of the LLVM backend (any writer of a language compiler) go to learn that a generated wasm file is now unusable as-is and that use of the wasm-ld linker is now a required part of their toolchain, thereby adding to the LLVM build requirements imposed on every user of the compiler? Where would they go to understand exactly how wasm-codegen translates LLVM visibility and other attributes into the linker section's information (especially the symbol flags)?

LLVM text-based output

"llvm can generate .s assembly files" That is what I am talking about. The information generated in the WebAssembly assembly file does not match what is generated in the .wasm binary file, and nowhere are these discrepancies documented. It's not a big deal except if you don't know that then you can't understand why the browser is complaining about a problem with the wasm file that literally is not visible in the generated assembly file (e.g., it imports __linear_memory from env). If the LLVM user knows they won't be exactly the same, they can use the wabt tools or Firefox to more quickly diagnose the problem.

Import module names

"For now we use 'env' for all imports". Where is this standard (and "memory" and "table") documented? For example, consider https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/WebAssembly/instantiateStreaming. There is no mention of this, and the example uses "imports". Even better, look at the example on this page: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/WebAssembly/Memory . It uses 'js' instead of 'env' to import memory and 'mem' instead of memory for the memory object. This example will fail and will fail silently if the wasm was built incorrectly by wasm-ld via the default "--export-memory" option. Two completely different memory objects will now exist, neither seeing the other's data. It took me some time to figure out why my JS functions could not see data passed to them via the wasm module. I personally believe this information should be broadly published across all relevant documents and easy to find, with the differences between exported vs. imported memory/table (and the implications/dangers of getting it wrong) clearly spelled out, especially when creators and JS users of wasm modules are not likely to be wizards at all these technical internals.

When additional flexibility is provided for alternate module names and two-level namespaces, I assume that appropriate documentation will also need to be updated.

Wrt https://reviews.llvm.org/D43675, renaming 'memory' to '__linear_memory', I understand the value for that change. I don't object if you do so, but please clearly publicize it as a breaking change, as people will need to change their Javascript programs to match the new default emitted by wasm-ld. It should also be reflected on the examples of WebAssembly pages that I cited above. Let me know if you want me to add this commentary to issue #82 .

I hope this feedback is helpful.

@sbc100 It appears that your changes to https://lld.llvm.org/WebAssembly.html have successfully landed. Here are my edit suggestions for this document that I believe would be extremely valuable to anyone who wants to understand how the linker determines what symbols to import and export:

Object file format

The object files that lld expects as input should be well-formed webassembly binary files that also include the required relocation and linker custom sections as specified in the WebAssembly tool conventions: https://github.com/WebAssembly/tool-conventions/blob/master/Linking.md. Any import and export sections in these object files will be ignored and stripped, as the linker uses information in the relocation and linker sections, along with linker options, to generate the export and import sections in the produced executable or shared library.

llvm automatically generates object files in the correct format when run with the wasm32-unknown-unknown target. To build llvm with WebAssembly support currently requires enabling the experimental backend using -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=WebAssembly.

The llvm backend generates a symbol's linker flags as follows:

WASM_SYM_BINDING_WEAK, based on symbol having any 'weak' or 'linkonce' llvm linkage type (e.g., when C++ generates the same template in multiple compilation units).
WASM_SYM_BINDING_LOCAL, based on the symbol having the 'internal' or 'private' llvm linkage type and is defined (e.g., when the C 'static' is used)
WASM_SYM_VISIBILITY_HIDDEN, based on the symbol having the llvm visibility style 'hidden' (vs. default).
WASM_SYM_UNDEFINED, based on the symbol not being defined (e.g., a function that specifies a signature but not implementation code).

[The following two sections would appear after the description of the linker's options]

Import Section Generation

All symbols placed in the import section (e.g., from Javascript) are assumed to come from the 'env' module by default:

Any live, used, relocatable (if weak) function symbol that is WASM_SYM_UNDEFINED
'memory' which references imported linear memory (if --import-memory is specified).
'__indirect_function_table' which references an imported table (if --import-table is specified)

Note: when importing functions, -- allow-undefined must be specified to avoid linker error messages.

Export Section Generation

These symbols are placed in the export section:

All live symbols, if --export-all is specified
Every live symbol specified by --export
All live symbols that are not WASM_SYM_VISIBILITY_HIDDEN and not WASM_SYM_BINDING_LOCAL
All live defined WASM_SYM_BINDING_LOCAL symbols that are not WASM_SYM_VISIBILITY_HIDDEN
'memory' which references the linear memory created by the module (the default when --import-memory is not specified)
'__indirect_function_table' which references a table created by the module (only if --export-table is specified)
'__heap_base' indicating the start of the heap
'__data_end' indicating the end of the heap

Notes on the above

Hopefully, I correctly interpreted and described the way the backend and linker code handle this decision logic. I did not include any additional information about --export-dynamic. I also did not include any information about the custom attribute that can be used to specify a module other than 'env', as I was not clear whether the current approach is stable.

WebAssembly / tool-conventions