[LLVM backend regression] RuntimeError: unreachable

buu700 commented 5 years ago

I just had to rebuild sidh.js with latest-fastcomp to work around an error caused by latest-upstream.

With the following test case:

git clone https://github.com/cyph/sidh.js
cd sidh.js
make
node -e "require('./dist/sidh').keyPair().then(kp => console.log(kp))"

The make succeeds, but then I get the following unexpected run-time error:

RuntimeError: unreachable
RuntimeError: unreachable

/cyph/dist/sidh.js:1
var sidh=function(){var A,I,g,B,C={};function Q(A,I){if(0===A)return I;throw new Error("SIDH error: "+A)}function i(A,I){return new Uint8Array(new Uint8Array(C.HEAPU8.buffer,A,I))}function E(A){try{C._free(A)}catch(A){setTimeout((function(){throw A}),0)}}C.ready=new Promise((function(A,I){(B={}).onAbort=I,B.onRuntimeInitialized=function(){try{B._sidhjs_public_key_bytes(),A(B)}catch(A){I(A)}};var g,B=void 0!==B?B:{},C={};for(g in B)B.hasOwnProperty(g)&&(C[g]=B[g]);var Q=[],i=!1,E=!1,o=!1,a=!1,G=!1;i="object"==typeof window,E="function"==typeof importScripts,a="object"==typeof process&&"object"==typeof process.versions&&"string"==typeof process.versions.node,o=a&&!i&&!E,G=!i&&!o&&!E;var n,r,c,t,D="";function e(A){return B.locateFile?B.locateFile(A,D):D+A}o?(D=__dirname+"/",n=function(A,I){var g;return(g=YA(A))||(c||(c=eval("require")("fs")),t||(t=eval("require")("path")),A=t.normalize(A),g=c.readFileSync(A)),I?g:g.toString()},r=function(A){var I=n(A,!0);return I.buffer||(I=new Uint8Array(
abort(RuntimeError: unreachable). Build with -s ASSERTIONS=1 for more info.

This problem goes away after running emsdk install latest-fastcomp ; emsdk activate latest-fastcomp ; make.

kripken commented 5 years ago

A full stack trace (in a build with names, so e.g. --profiling) might help. Also good would be to bisect to find where this broke.

TrevorSundberg commented 4 years ago

This is potentially unrelated, however I also received the RuntimeError: unreachable and it turned out it was because a missing return statement when the function returned a bool. Hope this helps some poor soul in the future. Also don't ignore warnings!

davlhd commented 4 years ago

I am experiencing the same problem. I wanted to switch to the new backend but it tells me

011245ce-331:8 Uncaught (in promise) RuntimeError: unreachable. at wasm-function[331]:0x13538 at wasm-function[11511]:0x2356e3 at n.dynCall_iiiiiii (https://xxx35:156437) at invoke_iiiiiii (https://xxs:35:102325) at wasm-function[14939]:0x3ae4f3 at wasm-function[11508]:0x235501 at n.dynCall_iiiiiiiii (https://xxxs:35:156969) at invoke_iiiiiiiii (https://xxx:35:102612) at wasm-function[14865]:0x3ab14e

kripken commented 4 years ago

@davlhd building with --profiling will make the stack trace have function names.

Once you find the function this happens in, it might be the same issue @TrevorSundberg said - LLVM will emit unreachables in places it thinks control flow can't reach, so if you forget a return, stuff like that.

davlhd commented 4 years ago

That did no really help:

00ceb43a:formatted:403 Uncaught (in promise) RuntimeError: unreachable
    at <anonymous>:wasm-function[285]:0xf851
    at dynCall_iiiiiii (<anonymous>:wasm-function[7519]:0x170aa2)
    at i.dynCall_iiiiiii (https://xxxx:35:140187)
    at invoke_iiiiiii (https://xxxx35:116645)
    at <anonymous>:wasm-function[9936]:0x2be4fd
    at dynCall_iiiiiiiii (<anonymous>:wasm-function[7517]:0x170a78)
    at i.dynCall_iiiiiiiii (https://xxxx:35:140719)
    at invoke_iiiiiiiii (https://xxxx:35:116932)
    at <anonymous>:wasm-function[9087]:0x24ae07
    at dynCall_iiiiii (<anonymous>:wasm-function[7520]:0x170ab4)

kripken commented 4 years ago

@davlhd is there a chance the --profiling flag was not applied at the link stage? Or that that's an old build? (It would be a very serious unknown bug if that didn't work.)

davlhd commented 4 years ago

With "latest" this error disappeared. It now compiles and runs fine

DoDoENT commented 3 years ago

It appears that this has reappeared with emscripten 2.0.15. With adding --profiling to the link flags, I got the following stack trace in my project (the stack trace points to the Normalize function within google test:

SweatShopTest.wasm:0x187f9 Uncaught RuntimeError: unreachable
    at testing::internal::FilePath::Normalize() (http://localhost:6931/SweatShopTest.wasm:wasm-function[683]:0x187f9)
    at testing::internal::FilePath::FilePath(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char> > const&) (http://localhost:6931/SweatShopTest.wasm:wasm-function[86]:0x25b3)
    at testing::internal::FilePath::GetCurrentDir() (http://localhost:6931/SweatShopTest.wasm:wasm-function[684]:0x18843)
    at testing::internal::UnitTestImpl::AddTestInfo(void (*)(), void (*)(), testing::TestInfo*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[619]:0x1653f)
    at testing::internal::MakeAndRegisterTestInfo(char const*, char const*, char const*, char const*, testing::internal::CodeLocation, void const*, void (*)(), void (*)(), testing::internal::TestFactoryBase*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[103]:0x2cad)
    at __cxx_global_var_init (http://localhost:6931/SweatShopTest.wasm:wasm-function[1158]:0x2b58f)
    at __wasm_call_ctors (http://localhost:6931/SweatShopTest.wasm:wasm-function[1117]:0x29973)
    at Module.___wasm_call_ctors (http://localhost:6931/SweatShopTest.js:3868:82)
    at func (http://localhost:6931/SweatShopTest.js:378:3)
    at callRuntimeCallbacks (http://localhost:6931/SweatShopTest.js:622:4)

Downgrading to emscripten 2.0.14 resolves the issue. Also, if I compile with -g (to emit debug symbols) or with threads enabled, the problem also goes away 🤔.

Any clue what could possibly go wrong?

kripken commented 3 years ago

Hard to say. Quickest diagnosis is probably bisection, https://emscripten.org/docs/contributing/developers_guide.html#bisecting

DoDoENT commented 3 years ago

I can't do the bisection as my machines are too slow for compiling the LLVM, however, I've managed to create a reproducible case with only open-source components - it's available in this repository.

To reproduce the issue:

clone this repository
initialize all its submodules with git submodule update --init
initialize your shell with emscripten 2.0.15
run emcmake cmake -GNinja /path/to/the/repository
build with ninja
run emrun --browser chrome ./SweatShop.html
application crashes with Uncaught RuntimeError: unreachable

If you do the same with emscripten 2.0.14, the application will work correctly.

Also, if you replace -flto with -flto=thin in the CMakeLists.txt, the application will also work with 2.0.15 (but I need the full LTO for my project).

Unlike in my project, adding -g here does not make the issue go away. Instead, it prints a nice stack trace:

Uncaught RuntimeError: unreachable
    at operator delete(void*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[83]:0x46d1)
    at std::__2::_DeallocateCaller::__do_call(void*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[140]:0x5565)
    at std::__2::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[138]:0x5516)
    at std::__2::_DeallocateCaller::__do_deallocate_handle_size_align(void*, unsigned long, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[135]:0x5497)
    at std::__2::__libcpp_deallocate(void*, unsigned long, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[134]:0x542f)
    at std::__2::allocator<char>::deallocate(char*, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[1167]:0x15b56)
    at std::__2::allocator_traits<std::__2::allocator<char> >::deallocate(std::__2::allocator<char>&, char*, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[1165]:0x15aea)
    at std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char> >::~basic_string() (http://localhost:6931/SweatShopTest.wasm:wasm-function[72]:0x44c0)
    at testing::internal::CodeLocation::~CodeLocation() (http://localhost:6931/SweatShopTest.wasm:wasm-function[71]:0x4498)
    at testing::internal::MakeAndRegisterTestInfo(char const*, char const*, char const*, char const*, testing::internal::CodeLocation, void const*, void (*)(), void (*)(), testing::internal::TestFactoryBase*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[70]:0x4442)

I hope this helps you with tracking down this bug.

kripken commented 3 years ago

@DoDoENT The instructions in that link do not require compiling. It downloads precompiled builds using the emsdk.

Sorry if they are not clear enough about that - we should improve them. @sbc100 also has had some recent improvements to the process which we meant to document (the emsdk now has a flag to install a build by the hash, which is slightly easier than editing the text file) but I don't think we have yet.

sbc100 commented 3 years ago

You might also want to try running with asan and/or ubsan.

DirtyHairy commented 3 years ago

Count me in. If I compile https://github.com/cloudpilot-emu/cloudpilot with 2.0.15 I get the follow trace at runtime:

    at std::__2::__tree<std::__2::__value_type<ChunkType, Chunk>, std::__2::__map_value_compare<ChunkType, std::__2::__value_type<ChunkType, Chunk>, std::__2::less<ChunkType>, true>, std::__2::allocator<std::__2::__value_type<ChunkType, Chunk> > >::destroy(std::__2::__tree_node<std::__2::__value_type<ChunkType, Chunk>, void*>*) (:4200/cloudpilot_web.wasm:wasm-function[2176]:0x611d1)
    at std::__2::__tree<std::__2::__value_type<ChunkType, Chunk>, std::__2::__map_value_compare<ChunkType, std::__2::__value_type<ChunkType, Chunk>, std::__2::less<ChunkType>, true>, std::__2::allocator<std::__2::__value_type<ChunkType, Chunk> > >::destroy(std::__2::__tree_node<std::__2::__value_type<ChunkType, Chunk>, void*>*) (:4200/cloudpilot_web.wasm:wasm-function[2176]:0x611c6)
    at emscripten_bind_Cloudpilot_LoadState_2 (:4200/cloudpilot_web.wasm:wasm-function[2365]:0x75cbc)
    at push.O5fm.Module._emscripten_bind_Cloudpilot_LoadState_2 (cloudpilot_web.js:4654)
    at Cloudpilot.push.O5fm.Cloudpilot.LoadState.Cloudpilot.LoadState (cloudpilot_web.js:5394)
    at Cloudpilot.loadState (Cloudpilot.ts:265)

The same code runs fine with 2.0.14 and also natively with ASAN and UBSAN enabled. It is worth mentioning that issue only appears if I build with -flto, without LTO the code runs fine. Seems like LTO erroneously removes code generated for std::map and replaces it with unreachable. Unfortunately the readme is not up to date, but I can provide instructions for reproducing the issue.

Coincidentally (?) I also get link time errors on 2.0.15 with LLD_REPORT_UNDEFINED:

wasm-ld: error: symbol exported via --export not found: __start_em_asm
wasm-ld: error: symbol exported via --export not found: __stop_em_asm

I don't have time for bisecting now, but I can do so later this week.

sbc100 commented 3 years ago

Thanks for the report @DirtyHairy. The LTO issue does look concerning, and we will probably need to bisect it down. I imagine its an llvm-side change.

The symbol exported via --export not found: __start_em_asm issue with LLD_REPORT_UNDEFINED I am aware of and thinking about how to work around that.

sbc100 commented 3 years ago

opened https://github.com/emscripten-core/emscripten/issues/13736 for the LLD_REPORT_UNDEFINED issue.

DirtyHairy commented 3 years ago

Thanks alot @sbc100 ! I'll try to find time to bisect the issue over the next few days

DirtyHairy commented 3 years ago

I have bisected the issue; the problematic commit to emscripten-releases is 7882b2e11c66da69b832a6ad7baa57c8ccbf9656

Indeed this is a LLVM change; unfortunately, the commit pulls in sizable number of changesets. I hope this helps nevertheless.

kripken commented 3 years ago

Thanks! Yes, looks like there are 94 LLVM commits there:

https://chromium.googlesource.com/emscripten-releases/+/7882b2e11c66da69b832a6ad7baa57c8ccbf9656

That will require someone to bisect locally on LLVM I think, unless one of the commits there looks relevant to the LLVM devs here.

DirtyHairy commented 3 years ago

Well, this might take a few days, but I am not using my Macbook much during the day atm, so I think I can take the ring to mount doom 😏

sbc100 commented 3 years ago

As well as bisecting to find the culprit commit, it might also be useful to reduce the repro case to something smaller that might be give us more clues.

DirtyHairy commented 3 years ago

As well as bisecting to find the culprit commit, it might also be useful to reduce the repro case to something smaller that might be give us more clues.

The affected code is pretty much isolated, so I think this might not be too hard. I'll take a stab at it after bisecting.

DirtyHairy commented 3 years ago

Well, that went down a lot faster than I thought. The faulty commit is:

https://github.com/llvm/llvm-project/commit/86664638898e9c3756ad17d612de1873fead6813

Seems like we indeed need a reduced testcase. I'll see what I can do.

DirtyHairy commented 3 years ago

I've cooked up a reduced testcase. You can find the code here: https://github.com/DirtyHairy/emcc-lto-testcase

Simply build with make (with 2.0.15) and run node testcase.js in order to trigger the bug:

pestix@bladehenge-2:~/git/emcc-lto-testcase$ node testcase.js
exception thrown: RuntimeError: unreachable,RuntimeError: unreachable
    at std::__2::__tree<std::__2::__value_type<ChunkType, ChunkProbe>, std::__2::__map_value_compare<ChunkType, std::__2::__value_type<ChunkType, ChunkProbe>, std::__2::less<ChunkType>, true>, std::__2::allocator<std::__2::__value_type<ChunkType, ChunkProbe> > >::destroy(std::__2::__tree_node<std::__2::__value_type<ChunkType, ChunkProbe>, void*>*) (<anonymous>:wasm-function[10]:0x8df)
    at __original_main (<anonymous>:wasm-function[8]:0x864)
    at main (<anonymous>:wasm-function[20]:0x278b)
    at Module._main (/Users/pestix/git/emcc-lto-testcase/testcase.js:1555:60)
    at callMain (/Users/pestix/git/emcc-lto-testcase/testcase.js:1615:15)
    at doRun (/Users/pestix/git/emcc-lto-testcase/testcase.js:1675:23)
    at run (/Users/pestix/git/emcc-lto-testcase/testcase.js:1690:5)
    at runCaller (/Users/pestix/git/emcc-lto-testcase/testcase.js:1602:19)
    at removeRunDependency (/Users/pestix/git/emcc-lto-testcase/testcase.js:1225:7)
    at receiveInstance (/Users/pestix/git/emcc-lto-testcase/testcase.js:1365:5)

The code runs fine with 2.0.14 or if -flto is omitted during link.

From the trace I suspect that the deleted code is in the destructor of the std::map<ChunkType, ChunkProbe> allocated in SavestateProbe.cxx (an instance of which is allocated on the stack in Savestate::AllocateBuffer.

DirtyHairy commented 3 years ago

Is there anything else that I can do to help addressing this issue?

kripken commented 3 years ago

@DirtyHairy Please file an LLVM bug (and link to it here). As the commit you reduced to says, it sounds like they expected fallout from it. Attaching your testcase will hopefully be enough for them to investigate and/or decide whether to revert it.

kripken commented 3 years ago

(Actually maybe it's faster to cc the author of that commit directly?)

@preames , the last few comments have a reduced testcase and some investigation of a regression that was bisected down to your commit llvm/llvm-project@8666463

preames commented 3 years ago

Happy to try and help here, but I really don't have much context on emscripten or node. So working from a test case which involves those pieces would be a bit challenging. (For instance, it's not clear to me whether your reduced test case was using ToT clang for the make step, or some other compiler?)

If you want to help narrow this down, what would be really helpful is knowing what the IR looks like at the point the nofree transform actually happens. A simple way to get that would be to do a local build of clang w/the transform replaced with a bit of code which simply printed the current module and then crashed. (If it triggers multiple times, you might need to log and continue.)

Once we had that, we could manually inspect the IR to see if the transform is legal. If it isn't, we revert my patch. If it is, we then have something to trace back through the print-after-all output to figure out where the wrong inference fact came from.

As an aside, I just ran across an issue today (see the llvm-dev thread "Ambiguity in the nofree function attribute") around the semantics of the nofree function attribute. If you were using the allocsize attribute, then maybe that might explain a miscompile. That's really a long shot guess though, nothing more.

kripken commented 3 years ago

Thanks for the quick response @preames !

Maybe another possibility is to convert the testcase to one that runs in ToT clang? It might be as simple as replacing emcc with ToT clang, and removing the .js suffix from the output.

I tried to do that quickly now, and I'm hitting an unrelated LLVM configuration issue I can't immediately figure out,

/usr/bin/ld: ../lib/LLVMgold.so: error loading plugin: ../lib/LLVMgold.so: cannot open shared object file: No such file or directory

But if you have a working local build of LLVM ToT then it might just work for you.

DirtyHairy commented 3 years ago

Thanks for the response, @preames / @kripken !

@preames I am not sure whether I understand top-of-tree correctly. Fwiw, I can reproduce the issue with the latest revision of LLVM main, including clang.

I have just checked and cannot reproduce the bug if I build native binaries. However, as even with emscripten the bug only appears if I enable LTO at link I think that the bug may be tied to a transformation that happens during LTO when linking the emscripten libc.

It seems that emscripten links to a libc that has been built with LTO enabled if I enable LTO during link? If this is the case, then this might be the decisive difference between emscripten and the native build (I don't think this happens with libc from the XCode SDK). Is there a way to disable this (use -flto objects, but not -flto libc) for testing purposes?

@preames If you can give me a few pointers of where to instrument which file I can try to produce the IR dumps you described.

DirtyHairy commented 3 years ago

As this is unsurprisingly still failing in 2.0.16 I have been playing a bit more with this in search of a workaround. It turns out that an example as simple as the following is enough to trigger the crash:

#include <map>

int main() {
    std::map<int, int> foo;

    foo[0] = 2;
}

pestix@bladehenge-2:~/git/emcc-lto-testcase$ emcc -v
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 2.0.16
clang version 13.0.0 (/opt/s/w/ir/cache/git/chromium.googlesource.com-external-github.com-llvm-llvm--project ad8010e598d9aa3747c34ce28aa2ba6de1650bd4)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /Users/pestix/bin/emsdk/upstream/bin

pestix@bladehenge-2:~/git/emcc-lto-testcase$ emcc -o test.js -O3 -flto -g4 test.cpp
emcc:WARNING: Wasm source map won't be usable in a browser without --source-map-base

pestix@bladehenge-2:~/git/emcc-lto-testcase$ node test.js
exception thrown: RuntimeError: unreachable,RuntimeError: unreachable
    at std::__2::__tree<std::__2::__value_type<int, int>, std::__2::__map_value_compare<int, std::__2::__value_type<int, int>, std::__2::less<int>, true>, std::__2::allocator<std::__2::__value_type<int, int> > >::destroy(std::__2::__tree_node<std::__2::__value_type<int, int>, void*>*) (<anonymous>:wasm-function[8]:0x1c3)
    at __original_main (<anonymous>:wasm-function[6]:0x17d)
    at main (<anonymous>:wasm-function[13]:0x1a27)
    at Module._main (/Users/pestix/git/emcc-lto-testcase/test.js:1558:60)
    at callMain (/Users/pestix/git/emcc-lto-testcase/test.js:1618:15)
    at doRun (/Users/pestix/git/emcc-lto-testcase/test.js:1677:23)
    at run (/Users/pestix/git/emcc-lto-testcase/test.js:1692:5)
    at runCaller (/Users/pestix/git/emcc-lto-testcase/test.js:1605:19)
    at removeRunDependency (/Users/pestix/git/emcc-lto-testcase/test.js:1223:7)
    at receiveInstance (/Users/pestix/git/emcc-lto-testcase/test.js:1365:5)
pestix@bladehenge-2:~/git/emcc-lto-testcase$

The crash goes away if I switch to unordered_map which is acceptable as a workaround for my project. It also goes away if I go to -O0 or disable -flto.

DirtyHairy commented 3 years ago

Seems I was too optimistic. I still keep getting similar runtime errors in other places even after removing all instances of map.

DirtyHairy commented 3 years ago

I have played a bit more and have narrowed my second crash to the following testcase:

#include <iostream>

using namespace std;

struct Foo {
    virtual ~Foo() { cout << "destructor called" << endl << flush; }
};

int main() { delete (new Foo()); }

The corresponding output is

pestix@bladehenge:~/git/emcc-lto-testcase$ node test2.js
destructor called
exception thrown: RuntimeError: unreachable,RuntimeError: unreachable
    at Foo::~Foo() (<anonymous>:wasm-function[49]:0x3648)
    at __original_main (<anonymous>:wasm-function[47]:0x3608)
    at main (<anonymous>:wasm-function[1259]:0x1f40c)
    at Module._main (/Users/pestix/git/emcc-lto-testcase/test2.js:4523:60)
    at callMain (/Users/pestix/git/emcc-lto-testcase/test2.js:4613:15)
    at doRun (/Users/pestix/git/emcc-lto-testcase/test2.js:4672:23)
    at run (/Users/pestix/git/emcc-lto-testcase/test2.js:4687:5)
    at runCaller (/Users/pestix/git/emcc-lto-testcase/test2.js:4600:19)
    at removeRunDependency (/Users/pestix/git/emcc-lto-testcase/test2.js:1224:7)
    at receiveInstance (/Users/pestix/git/emcc-lto-testcase/test2.js:1366:5)

As above, the crash only happens with optimization level > 0 and LTO enabled. Furthermore, it only happens if the destructor is virtual, and if the instance is allocated on the heap. If I allocate the instance of Foo on the stack or remove the virtual the sample runs fine. Also, note that the string is written to STDOUT, so the deleted code must be called after the destructor itself has executed.

Again, I cannot reproduce the issue if I do a native build.

sbc100 commented 3 years ago

That certainly looks like some kind of codegen or optimization bug. Thanks for reducing the test case so much, I'm sure it will make debugging and fixing much easier.

sbc100 commented 3 years ago

Hmm.. I can't seem to repro:

$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 2.0.16 (52ed04f424ff0db7d8085c35df3afd7f7b1c9eb7)
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ emcc -flto -O3 test.cc && node ./a.out.js
destructor called
$

DirtyHairy commented 3 years ago

Yeah, sorry, my bad, I just noticed that the sample above runs fine with -O2 and -O3 (and crashes only with -O1). It seems that the allocation and deallocation is optimized out. The following crashes at all levels > 0:

#include <iostream>

using namespace std;

struct Foo {
    virtual ~Foo() { cout << "destructor called" << endl << flush; }
};

int main() {
    auto foo = new Foo();

    cout << &foo << endl << flush;

    delete foo;
}

kripken commented 3 years ago

I think the simplest testcase here that I can reproduce locally is this one:

// emcc -flto -O1
#include <map>
int main() {
  std::map<int, int> foo;
  foo[0] = 2;
}

It does seem like the cause is LTO with libc++, as that is shared in all the testcases.

I'd suggest making a native example of this for @preames by making a git repo with the cpp file plus a copy of libc++ from the emscripten tree (it's under system/lib/libcxx). Then all the libc++ inputs will be "in userspace", like for emcc, and no system c++ libraries will be linked. I'd expect the bug to reproduce that way.

DirtyHairy commented 3 years ago

Sorry, now I don't fully understand: how would I do a native build from this? I suppose I have to build libc++ first?

kripken commented 3 years ago

Yes, but it's very simple: you will likely need very few of the libc++ .cpp files (since it's almost all in headers). You may even be able to get away with not running any configure/cmake step, and just adding them.

In more detail, what you need to do is build with -nostdinc++, so that the system C++ standard library is not used, and add an -I for the place with the copy of libc++ so that it is used. And find the right .cpp files from that copy of libc++ and just include them as more inputs, alongside your .cpp file. That way all the C++ standard library support arrives "from userspace", and that means the compiler can LTO it. (Also, doing it this way will ensure you have the same C++ standard library as emcc does.)

Another option could be to copy code from the libc++ sources into your testcase (edit: and keep copying more until the bug reproduces).

DirtyHairy commented 3 years ago

Thanks 😏 I have put up a version of the testcase with a Makefile that does a native build in my previous testcase repo https://github.com/DirtyHairy/emcc-lto-testcase (I've moved the old testcase to a branch). Unfortunately, I cannot reproduce the crash with my LLVM build (the ToT build from last week that triggers the crash when used with emscripten).

Can you spot any error in my build process? To my understanding it should not include the system libc++ headers or link agains the system libc++

preames commented 3 years ago

Given how much this has been reduced, I might be able to make progress with something as simple as a -print-after-all log. If you want to capture that from your reproducer - I can even work with one not from clean upstream - I can see if anything obvious falls out.

DirtyHairy commented 3 years ago

Hi @preames !

I have attached a -print-after-all log of the failing std::map testcase (ir.bad.log) produced with emscripten 2.0.16 and a reference log(ir.good.log) with emscripten 2.0.14.

I can do builds from specific LLVM revisions if that helps you.

Thanks again for your support.

ir.bad.log ir.good.log

preames commented 3 years ago

Digging through the log, I noticed one case where we'd incorrectly annotated a nobuiltin function. This isn't directly related to the triggering change, but it's easy to believe that change exposed the existing issue. There might also be another latent bug here.

Can I ask for a rerun with commit 11707435ccb44a9377bfed407453e0646a159636 applied? If it still reproduces, another -print-after-all with the additional flag -print-module-scope added (primarily for the bad run) would be helpful.

DirtyHairy commented 3 years ago

Partial success 😏 With LLVM built from 11707435ccb44a9377bfed407453e0646a159636 the testcase now builds and runs fine with -O1. However, it still crashes at -O2 and -O3. I have attached two logs at -O2, one with -print-module-scope, and one without it. Beware though, the -print-module-scope one decompresses to ~800MB.

Sorry for the double compression, but the log is too large with gzip, and github will not let me upload bzip2 directly.

ir.bad.log.bz2.gz ir.bad.print-module-scope.log.bz2.gz

preames commented 3 years ago

Haven't had a chance to dig through the second set of logs yet, but I think I have a good guess as to the problem. It turns out to be a bug in my original patch w.r.t. interaction between null and the nofree specification. I've got a candidate fix up for review at https://reviews.llvm.org/D100779, and would appreciate knowing if that resolved the remaining problems here. (It may very well be we have a third issue to track down.)

DirtyHairy commented 3 years ago

I have checked, but 2218f5998b5b with the above patch applied still crashes (on the std::map). Fwiw there is some variance in behaviour between emscripten (with LLVM built from source) 2.0.16 and 2.0.14 --- with 2.0.16, the std::map testcase dies for all optimization levels while it runs fine with -O1. With the virtual destructor testcase, the crash reproduces with -O1 on 2.0.16 (and LLVM from source) --- 2.0.14 works fine, as does 2.0.16 with -O2 and -O3 😛

@preames Ping me if you need a more recent log.

preames commented 3 years ago

I think I'm going to need a more current log. I've dug through this one, and don't see any direct evidence of the transform triggering or any further obvious miscompiles.

Can I ask you to take two additional steps before sending a log? First, can you comment out the problematic transform entirely and just make sure everything passes? I'm getting suspicious that there might be multiple interacting issues here. Second, could you add code to print the function, callsite, and argument when the transform triggers? If needed, I can post a patch that does that. I think I need to see what this is actually triggering on to make progress here.

DirtyHairy commented 3 years ago

Hi @preames !

Sure; however, I think I need a few additional pointers to produce what you want. Do you have a specific transform in mind for me to disable, or do you want to go through them until we have identified the problematic one(s)? Where in LLVM should I look for the place to disable them?

preames commented 3 years ago

I have gone ahead and reverted the triggering change in commit 15e19a25 since we don't appear to be making rapid progress any more. At the current moment, I don't have any evidence the (fixed) change was actually buggy any more, but I also am out of time to chase down the remaining miscompile. I may return to this in the future, but wanted to not leave things broken in the meantime.

DirtyHairy commented 3 years ago

Thanks alot @preames . To be 100% sure I just did a test build; the testcases run fine again. Sorry we couldn't drill down into the underlying issue. If want to revisit this at some later point feel free to ping me.

paulocoutinhox commented 3 years ago

Hi,

I have the same problem here: https://github.com/paulo-coutinho/pdfium-lib/issues/54

There is any solution?

emscripten-core / emscripten

[LLVM backend regression] RuntimeError: unreachable #9443