HaxeFoundation / neko

The Neko Virtual Machine
https://nekovm.org
Other
550 stars 106 forks source link

Neko thread usage causes seg faults during global free #281

Open tobil4sk opened 1 year ago

tobil4sk commented 1 year ago

Ever since haxelib was updated to use threads on neko, it has been segfaulting randomly in github actions. e.g.

Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with 139 in 1s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Segmentation fault (core dumped)

I haven't been able to reproduce at all on any local systems, but I did some troubleshooting and I found that the seg fault occurs after the main function is completed, at some point after this call, but before the program closes: https://github.com/HaxeFoundation/neko/blob/master/vm/main.c#L342.

I managed to download the core dump and load it, and it says that the seg fault comes from line 46 here: https://github.com/HaxeFoundation/neko/blob/9076cfa9dfd517da128a54fcabee5abe4129790b/vm/callback.c#L44-L48

I later added a printf here and confirmed that during the segfault, vm is a null pointer. Perhaps there is a finaliser that is getting called after the main function has already finished or something?

Full backtrace ``` Core was generated by `haxelib git utest https://github.com/haxe-utest/utest master --always'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0) at /src/vm/callback.c:46 46 /src/vm/callback.c: Bad file descriptor. [Current thread is 1 (LWP 2473)] (gdb) bt #0 0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0) at /src/vm/callback.c:46 #1 0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8) at /src/vm/interp.c:708 #2 0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8) at /src/vm/interp.c:1214 #3 0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 , f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1, exc=0x7f30d5e3dd20) at /src/vm/callback.c:117 #4 0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237 #5 0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122 #6 0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2 #7 0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2 #8 0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2 #9 0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6 #10 0x0000000000000000 in ?? () (gdb) bt full #0 0x00007f30d8f14ef6 in neko_val_callEx (vthis=0x7f30d782a000, f=0x7f30d8b4d8a0, args=0x7f30d5e3d7f8, nargs=1, exc=0x0) at /src/vm/callback.c:46 vm = 0x0 old_this = 0x0 old_env = 0x0 ret = 0x0 oldjmp = {{__jmpbuf = {0, 0, 0, 0, 139845314828357, 139847775009936, 7883446016, 16}, __mask_was_saved = -706488560, __saved_mask = {__val = {1, 139847723636592, 139847774906397, 1, 139847770864720, 17450007603122798595, 139847750572680, 139847723636592, 38654705672, 17450007603122798600, 139847750525952, 139847723636592, 139847774911661, 17450007606711277424, 139847750819840, 139847750572672}}}} #1 0x00007f30d8f17818 in neko_interp_loop (vm=0x7f30d77e61c0, m=0x7f30d8b4cea0, _acc=139847740806880, _pc=0x7f30d77109b8) at /src/vm/interp.c:708 _o = 0x7f30d782a000 _arg = 0x1 _f = 0x7f30d8b4d8a0 acc = 1 pc = 0x7f30d76efe28 instructions = {0x7f30d8f170c2 , 0x7f30d8f170dc , 0x7f30d8f170f5 , 0x7f30d8f1710e , 0x7f30d8f17128 , 0x7f30d8f17188 , 0x7f30d8f171ab , 0x7f30d8f171c7 , 0x7f30d8f172d0 , 0x7f30d8f175b4 , 0x7f30d8f18081 , 0x7f30d8f18417 , 0x7f30d8f18430 , 0x7f30d8f18453 , 0x7f30d8f1846f , 0x7f30d8f18578 , 0x7f30d8f18791 , 0x7f30d8f18b88 , 0x7f30d8f18f21 , 0x7f30d8f18f3e , 0x7f30d8f18f9e , 0x7f30d8f19dc2 , 0x7f30d8f1a804 , 0x7f30d8f1b24f , 0x7f30d8f1b264 , 0x7f30d8f1b28e , 0x7f30d8f1b2b8 , 0x7f30d8f1b3c7 , 0x7f30d8f1b4f6 , 0x7f30d8f1b5a2 , 0x7f30d8f1b716 , 0x7f30d8f1b847 , 0x7f30d8f1b8df , 0x7f30d8f1b916 , 0x7f30d8f1b94d , 0x7f30d8f1c72d , 0x7f30d8f1d4d2 , 0x7f30d8f1e269 , 0x7f30d8f1e822 , 0x7f30d8f1f6d2 , 0x7f30d8f1f910 , 0x7f30d8f1fb4e , 0x7f30d8f1fd92 , 0x7f30d8f1ffb8 , 0x7f30d8f201de , 0x7f30d8f20404 , 0x7f30d8f20487 , 0x7f30d8f20603 , 0x7f30d8f20686 , 0x7f30d8f204fd , 0x7f30d8f20580 , 0x7f30d8f1b893 , 0x7f30d8f20709 , --Type for more, q to quit, c to continue without paging--c 0x7f30d8f20743 , 0x7f30d8f20808 , 0x7f30d8f20911 , 0x7f30d8f20943 , 0x7f30d8f18fe0 , 0x7f30d8f17161 , 0x7f30d8f17174 , 0x7f30d8f179a7 , 0x7f30d8f17d10 , 0x7f30d8f207c1 , 0x7f30d8f1929a , 0x7f30d8f20980 , 0x7f30d8f1b7a2 , 0x7f30d8f1713e , 0x7f30d8f2098f } sp = 0x7f30d6eab7a8 csp = 0x7f30d6eab058 #2 0x00007f30d8f20e24 in neko_interp (vm=0x7f30d77e61c0, _m=0x7f30d8b4cea0, acc=139847740806880, pc=0x7f30d77109b8) at /src/vm/interp.c:1214 sp = 0x7f30d6eab768 csp = 0x7f30d6eab078 trap = 0x7f30d6eab738 init_sp = 7 m = 0x7f30d8b4cea0 old = {{__jmpbuf = {0, 4064061087093578727, 140727057721422, 140727057721423, 140727057721680, 139847723638720, 4064061087267642343, 4064050217118686183}, __mask_was_saved = 0, __saved_mask = {__val = {0 }}}} #3 0x00007f30d8f15511 in neko_val_callEx (vthis=0x7f30d914f870 , f=0x7f30d6e9b360, args=0x7f30d8b490f8, nargs=1, exc=0x7f30d5e3dd20) at /src/vm/callback.c:117 n = 1 vm = 0x7f30d77e61c0 old_this = 0x7f30d914f870 old_env = 0x7f30d914eee0 ret = 0x7f30d914f870 oldjmp = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {__val = {0 }}}} #4 0x00007f30d7909af1 in thread_loop (_p=0x7f30d8b490f0) at /src/libs/std/thread.c:237 p = 0x7f30d8b490f0 exc = 0x0 #5 0x00007f30d8f26456 in ThreadMain (_p=0x7ffd92492990) at /src/vm/threads.c:122 lp = 0x7ffd92492990 p = {init = 0x7f30d7909a1b , main = 0x7f30d7909a99 , param = 0x7f30d8b490f0, lock = {__data = { __lock = 2, __count = 0, __owner = 2429, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000}\t\000\000\001", '\000' , __align = 2}} #6 0x00007f30d8f41678 in GC_inner_start_routine () from fs/usr/local/lib/libneko.so.2 No symbol table info available. #7 0x00007f30d8f3558a in GC_call_with_stack_base () from fs/usr/local/lib/libneko.so.2 No symbol table info available. #8 0x00007f30d8f3b144 in GC_start_routine () from fs/usr/local/lib/libneko.so.2 No symbol table info available. #9 0x00007f30d8ed2609 in pwd_traced_file () from fs/lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #10 0x0000000000000000 in ?? () No symbol table info available. ```

Here is the code in haxelib that uses threads: https://github.com/HaxeFoundation/haxelib/blob/4.1.x/src/haxelib/client/Vcs.hx#L162-L177

tobil4sk commented 1 year ago

We just had a similar crash on Windows, so looks like it's not specific to Linux:

Command: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]
Installing utest from https://github.com/haxe-utest/utest branch: master
Library utest current version is now git
Command exited with -1073741819 in 3s: haxelib [git,utest,https://github.com/haxe-utest/utest,master,--always]

-1073741819 is equivalent to 0xC0000005, which is STATUS_ACCESS_VIOLATION: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55

tobil4sk commented 2 weeks ago

This sample seems to reproduce the seg fault some of the time, at least on my windows machine:

function main() {
    final streamsLock = new sys.thread.Lock();

    sys.thread.Thread.create(function() {
        Sys.sleep(0.2);
        streamsLock.release();
    });

    sys.thread.Thread.create(function() {
        Sys.sleep(0.2);
        streamsLock.release();
    });

    streamsLock.wait();
    streamsLock.wait();
}