Rename argc/argv `main` to `__main_argc_argv`

sunfishcode commented 4 years ago

Change the ABI for the user entry point to rename main to __main_argc_argv when it has argc/argv. This is needed because wasm requires caller and callee signatures to match, so the traditional trick of passing main arguments even if it doesn't expect them doesn't work.

LLVM and related tools have been using the name __original_main for a similar purpose, but that's a confusing name.

There's also a change here in that this is renaming the argc/argv form rather than renaming the nullary form. The choice is somewhat arbitrary, but I think it's slightly nicer to bias the aesthetics toward the no-argument form, because that's the smaller and simpler form.

sunfishcode commented 4 years ago

As a heads-up, as I was writing this up, it occurred to me that renaming the argc/argv form would be a little nicer than renaming the no-arg form, but I forgot to adjust the title before submitting the pull request. I've now corrected the title.

sunfishcode commented 4 years ago

Yes, that's right. It simplifies things to say that we just rename main, rather than that we keep main around and add a wrapper.

I also plan to submit a clang patch to do this rename in the frontend, instead of in the LLVM backend, to eliminate the wrapper, which has turned out to be a nuisance in practice.

kripken commented 4 years ago

In that case I'm not that happy about the part where we export a different name from the wasm module. That's more complicated than the current single export, and is user-visible and perhaps confusing.

sunfishcode commented 4 years ago

I've now posted a clang patch which implements the __main_argc_argv proposal here.

It's a name mangling rule, which means:

If users call main from other translation units, those callsites are also renamed to match.
When debug info is emitted, it uses the user's name "main".

It's very simple. There's no generated wrapper function, linking works the same way for LTO and non-LTO, and libc code can easily detect whether to link in the command-line argument initialization code by defining its own main function in libc.a:

__attribute__((weak))
int main(void) {
    int argc = ...;
    char **argv = ...;
    return __main_argc_argv(argc, argv);
}

It also isn't that different from the __original_main situation we have right now, where Binaryen is special-case inlining it. If you inline __main_argc_argv into main here, you'll end up with a function called main containing the user main code, so backtraces etc. will look like you want them to.

kripken commented 4 years ago

Sorry @sunfishcode , I'm not sure whether your answer addresses or does not the topic of a different export name?

sunfishcode commented 4 years ago

It doesn't change the _start export. I'd ideally like to rename _start also, but that's a bigger change we can talk about separately.

kripken commented 4 years ago

I apologize, maybe I wasn't clear before. I wasn't talking about _start which I agree is a separate issue. What I'm asking about is what I asked about earlier too, that this proposal makes the wasm module export either main or __main_argc_argv. I wasn't sure if your last comment refers to this (as you mention other callsites etc. which I think may involve exports?).

sunfishcode commented 4 years ago

Ah; in wasi-sdk, main isn't exported from the final wasm binary. I guess in Emscripten it is exported, or it can be?

It is possible for a symbol name to differ from the wasm export name -- @sbc100's export attribute patch allows this, for example. So one possible option is to export __main_argc_argv as main at the wasm module level.

kripken commented 4 years ago

Yes, in non-standalone wasm, emscripten exports main from the wasm binary. I wouldn't want to have a convention that where the export name can differ, so yeah, keeping the same export name as always sounds best.

I'm not sure what "entrypoint" means, in the text of this PR, if it isn't an export? Did we define it somewhere?

sunfishcode commented 4 years ago

I've now rewritten the wording to give a definition of "entrypoint" and mention that export names may differ from ABI symbol names, and to hopefully be more clear overall. Let me know what you think!

sbc100 commented 4 years ago

Right now emscripten has two entry-point modes. The primary mode is where "main" is exported and called directly from JS after the constructors have been called. In this mode there is no _start, only main.

The second mode we what we call "standalone wasm mode" where _start is exported and main is not.

I'm not sure how this change will effect the first/primary mode that emscripten uses, but in general it looks like a good change. We currently export main by passing --export=main which is a linker flag (and which will work with wask-sdk too btw). I think this change as proposed might require some emscripten-side changes. Since we assume we can pass args to main from JS and due to the JS calling conventions that works for 0-arg and 2-arg main both.

If we can move emscripten over to always use _start then this problem would go away. Do you know of any reasons why that might be a bad idea @kripken ?

kripken commented 4 years ago

@sbc100 Emscripten changing from main to _start for the export to be called might be disruptive for existing users, that call main themselves directly. (Also, I do think main is the better name!) So I'd prefer to not do that, but it looks like this PR isn't?

The current wording looks mostly reasonable to me, but I don't know the linking process well enough to say if there is any downside to this over the current __original_main. @sbc100 do you think it would have problems with our standalone support in emscripten?

sbc100 commented 4 years ago

To support the status quo in emscripten we would need way to export a symbol called main which takes args. I sounds like under this proposal --export=main would result in the zero arg main being exported.

sbc100 commented 4 years ago

I'm not a fan of the current __original_main mangling or the way its found its way into our crt1.c code in both wasi and emscripten, so if we can move away from that I would be happy.

sunfishcode commented 4 years ago

As a possible path forward, I've now updated the clang patch which implements this: https://reviews.llvm.org/D70700 so that it checks the target, and doesn't do the mangling on Emscripten. That way, Emscripten is unaffected by this change.

sunfishcode commented 4 years ago

@sbc100 @kripken Any update here? I need some solution to this problem in order to enable LTO builds of libc, and right now, compatibility with Emscripten appears to be the only thing holding this back.

kripken commented 4 years ago

I don't quite understand what that patch does (not familiar with that part of clang), but if it doesn't affect emscripten then I have no objection to it myself. (That seems separate from the discussion in this PR, but I don't think we have a clear answer here?)

sunfishcode commented 4 years ago

In the LLVM patch, this change is disabled when the target triple OS is Emscripten. Also, @sbc100's comments above suggest there might be a reasonable path for implementing this in Emscripten when it's ready as well. With that, are there any remaining concerns about landing this and proceeding with the changes to LLVM, wasi-libc?

sbc100 commented 4 years ago

I support this change, but I'd be sad if can't find a way to make this work in emscripten too. I fairly certain that we can but I've not invested any time in it yet. I can you give me a little more time to prototype. Maybe by EOD Wednesday? If don't find any fundamental issues by then we can go ahead and land?

sunfishcode commented 4 years ago

@sbc100 Sounds good to me.

sunfishcode commented 4 years ago

It's now EOD on Wednesday; is there an update?

sbc100 commented 4 years ago

Did you verify that the debug name in the dwarf section is correct? I couldn't see the functions names in the output of llvm-dwarfdump --all which means I'm probably doing something wrong.

I did confirm that the name section does have the mangled name in it, despite then name section supposedly containing the demangled names. This I believe is because the name section is generated by the linker by demanging the mangled names and I guess doesn't (yet) know how to demangle __main_argv_argv back into main.

sunfishcode commented 4 years ago

The llvm-dwarfdump --all output contains the name "main" as the DW_AT_name field for me (and "__main_argc_argv" in the DW_AT_linkage_name, as expected).

Confirmed about the name section; I propose the following patch, which fixes it:

diff --git a/lld/wasm/Symbols.cpp b/lld/wasm/Symbols.cpp
index 88e6f217c0d..0d4f7f648d5 100644
--- a/lld/wasm/Symbols.cpp
+++ b/lld/wasm/Symbols.cpp
@@ -29,6 +29,8 @@ std::string toString(const wasm::Symbol &sym) {
 }

 std::string maybeDemangleSymbol(StringRef name) {
+  if (name == "__main_argc_argv")
+    return "main";
   if (wasm::config->demangle)
     return demangleItanium(name);
   return std::string(name);

sbc100 commented 4 years ago

OK, I've played with the patch in emscripten and I think we have a way forward. I'm going ahead with this and the llvm patch.

I have provisional patch for binaryen and emscripten so we should be able remove the emscripten exclusion from the llvm patch (either before or after landing it).

WebAssembly / tool-conventions

Rename argc/argv `main` to `__main_argc_argv` #134