floooh / sokol-odin

Odin bindings for the sokol headers (https://github.com/floooh/sokol)
103 stars 17 forks source link

Examples segfault on Linux (most likely related to #by_ptr annotations) #4

Closed EriKWDev closed 1 year ago

EriKWDev commented 1 year ago

Seems like examples are no longer compiling due to missing trailing commas all over the place.

For example:

sokol-odin/build > odin version
odin version dev-2023-01:c949e404

sokol-odin/build > odin run ../examples/clear -debug
/home/erik/Documents/GitHub/sokol-odin/examples/clear/main.odin(53:40) Syntax Error: Missing ',' before newline in compound literal

once I fix the trailing commas I get a segfault:

sokol-odin/build > odin build ../examples/clear -debug
sokol-odin/build > ./clear.bin
zsh: segmentation fault (core dumped)  ./clear.bin

Running on

OS:   Fedora Linux 37 (Workstation Edition), Linux 6.1.5-200.fc37.x86_64
CPU:  11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
floooh commented 1 year ago

Ok, PR for the commas is merged, the segfaults are concerning though. Only reason I can think of is that it's an ABI issue where function parameters aren't correctly passed from Odin to the C side (e.g. by value vs by pointer).

I'll need to look into this after I finished up my current cleanup work on sokol_audio.

gingerBill commented 1 year ago

Do you happen to know where it crashes? e.g. which call caused it?

thePHTest commented 1 year ago

I can't repro on Windows with latest sokol & Odin. Odin: dev-2023-01:c949e404 OS: Windows 11 Home Basic (version: 21H2), build 22000.1455

@EriKWDev Did you remember to re-run build_clibs_linux.sh after updating sokol-odin?

gingerBill commented 1 year ago

Did you also check to see if it worked without the -debug flag? If it's -debug related, that's nice to know.

floooh commented 1 year ago

On M1 Mac I also cannot reproduce the segfault (both with and without -debug). I tested with brew-installed Odin, both --HEAD odin version dev-2023-01:c949e404 and regular odin version dev-2023-01:86511d44.

I'll check on my Ubuntu laptop next...

floooh commented 1 year ago

@EriKWDev Did you remember to re-run build_clibs_linux.sh after updating sokol-odin?

Right, that could actually be the reason which is easy to overlook. I wonder if I can somehow make this more obvious or even automate. One solution would be commit the prebuilt libs into the git repo, but I don't want to go there yet :)

Ok that's not it.

floooh commented 1 year ago

@gingerBill I can reproduce the segfault in the produced executble on Linux with odin version dev-2023-01-nightly:86511d44 (haven't tested other versions yet), happens both with and without --debug (so it seems to be a Linux specific issue).

I'll see if I can find out more in a debugger...

floooh commented 1 year ago

Oh wow, something unrelated: I just randomly ran on my M1 Mac:

odin run ../examples/offscreen -debug

...and this crashed the compiler with:

 odin(77871,0x1f03b7a80) malloc: Heap corruption detected, free list is damaged at 0x600001793540
*** Incorrect guard value: 93550727410944
odin(77871,0x1f03b7a80) malloc: *** set a breakpoint in malloc_error_break to debug

...now that I try it repeatedly it happens fairly often (this is for version odin version dev-2023-01:86511d44 which I think is the 'stable' version installed via homebrew.

odin(78430,0x1f03b7a80) malloc: Heap corruption detected, free list is damaged at 0x6000000743f0
*** Incorrect guard value: 37697511310080
odin(78430,0x1f03b7a80) malloc: *** set a breakpoint in malloc_error_break to debug
odin(78359,0x1f03b7a80) malloc: Heap corruption detected, free list is damaged at 0x6000004b6550
*** Incorrect guard value: 125046091900160
odin(78359,0x1f03b7a80) malloc: *** set a breakpoint in malloc_error_break to debug

...ok, looks like I cannot reproduce this specific issue any longer in the --HEAD version (odin version dev-2023-01:c949e404)... back to that Linux topic :)

floooh commented 1 year ago

...deleted my last comment because it was entirely misleading :D

Ok, about that Linux crash...

The crash happens on the C side in a string copying function because it gets an invalid src pointer (its not a null pointer but some random value):

https://github.com/floooh/sokol/blob/b3aecb21825f3191a9a150bc0918fc8c7df7855c/sokol_app.h#L2870-L2889

...which is called from here, the corrupt pointer is _sapp.desc.window_title:

https://github.com/floooh/sokol/blob/b3aecb21825f3191a9a150bc0918fc8c7df7855c/sokol_app.h#L2947

...this window title string pointer is ultimately passed in here on the Odin side via a struct:

https://github.com/floooh/sokol-odin/blob/5fddee836fe3cb42afa2933a46310a71fa340420/examples/cube/main.odin#L135-L146

...hm ok... I think this is related to the struct being created inside the function call. If I change the main function like this:

main :: proc () {
    desc := sapp.Desc{
        ...
    };
    sapp.run(desc);
}

...then the sapp.run() call seems to work, but it's then crashing in another function where I'm using the same construct to build the struct inside a function call.

Happens only on Linux though!

floooh commented 1 year ago

The problem goes all the way back to when #by_ptr was introduced in dev-2022-07 (that's how far I went back checking older Odin versions for now... to go further back I would need to roll back the bindings to the commit before #by_ptr was introduced.

I guess I didn't check carefully enough on all platform when that change happened in the bindings.

sapp.run() is one of those function which now use the #by_ptr annotation and it looks like all other functions which use #by_ptr have a similar problem on Linux (at least when the struct value is initialized right in the function call).

Previously there was a generated Odin wrapper function which took a struct value as input, took the address of the value and passed that on to the C function which expects a pointer.

floooh commented 1 year ago

...hmm no, it's not quite as simple as "only happens when the struct is initialized inside the function call".

Even when changing one of the samples to move all the struct initializations out of the function call, there's still data corruption, it's just different places where this data corruption is caught (e.g. not resulting in a pointer access segfault, but an assert on the C side might trigger).

floooh commented 1 year ago

Going back to the generated Odin bindings from 15-Oct-2022 (this was one day before the #by_ptr annotation was introduced in the sokol-odin bindings) makes it work with the latest Odin compiler.

Instead of the #by_ptr annotations, the functions which expect a pointer on the C side but a value on the Odin side, the external C versions were described like this:

sapp_run :: proc(desc: ^Desc)  ---

...and then there was an Odin wrapper function which took care of the value-to-pointer conversion which looked like this:

run :: proc(desc: Desc)  {
    _desc := desc
    sapp_run(&_desc)
}
gingerBill commented 1 year ago

Well this a bizarre bug indeed. I'm very surprised that it is failing this way but should be easy enough to fix now that I know what the problem is.

floooh commented 1 year ago

Oki doki, I think that's also all I can provide from my side for now. I'll go back to my regular sokol things then :)

Thanks for looking into this!

EriKWDev commented 1 year ago

@thePHTest Yes, I rebuilt the c-libs. I also tried deleting the sokol-odin folder, re-cloned it and still got a segfault.

I also compiled the odin compiler with debug enabled (./build_odin.sh debug) and then ran the 'clear' example with gdb. After the segfault, I type bt full and got this:

Reading symbols from clear.bin...
(gdb) run

Starting program: /home/erik/Documents/GitHub/sokol-odin/build/clear.bin
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x000000000044b336 in _sapp_strcpy (src=0x12c00000190 <error: Cannot access memory at address 0x12c00000190>, dst=0x471ac8 <_sapp+3784> "", max_len=128) at c/sokol_app.h:2875
2875            c = *src;

(gdb) bt full

#0  0x000000000044b336 in _sapp_strcpy (src=0x12c00000190 <error: Cannot access memory at address 0x12c00000190>, dst=0x471ac8 <_sapp+3784> "", max_len=128) at c/sokol_app.h:2875
        i = 0
        __PRETTY_FUNCTION__ = "_sapp_strcpy"
        end = 0x471b47 <_sapp+3911> ""
        c = 0 '\000'
#1  0x000000000044b7cb in _sapp_init_state (desc=0x7fffffffd810) at c/sokol_app.h:2947
        __PRETTY_FUNCTION__ = "_sapp_init_state"
#2  0x000000000044fe20 in _sapp_linux_run (desc=0x7fffffffd810) at c/sokol_app.h:11760
        pthread_attr = {__size = '\000' <repeats 17 times>, "\020", '\000' <repeats 37 times>, __align = 0}
        visual = 0x0
        depth = 0
#3  0x0000000000450090 in sapp_run (desc=0x7fffffffd810) at c/sokol_app.h:11856
        __PRETTY_FUNCTION__ = "sapp_run"
#4  0x000000000040d646 in main.main () at main.odin:46
        context = <optimized out>

@gingerBill I tried compiling the application without -debug and it still segfaults. I have also tried compiling the Odin compiler itself both in debug and release mode, and I get a segault when running the sokol example either way.


@floooh I tried your "fix" with creating a separate desc variable. It still crashes like you said later due to some assertion:

from examples/clear/main.odin

main :: proc() {
    desc := sapp.Desc {
        init_cb = init,
        frame_cb = frame,
        cleanup_cb = cleanup,
        width = 400,
        height = 300,
        window_title = "clear",
        icon = { sokol_default = true },
    } 

    sapp.run(desc);
}
/clear.bin
clear.bin: c/sokol_app.h:3016: _sapp_image_validate: Assertion `desc->width > 0' failed.
zsh: IOT instruction (core dumped)  ./clear.bin

If there is anything further I can do to help debug this please let me know. I am quite new to Odin, but I will try to checkout and older version of the Odin compiler and see if I can get the examples to run

floooh commented 1 year ago

Thanks @EriKWDev These are the same problem I'm seeing. And it seems related to the new #by_ptr bindings feature. I think there's not much more we can do at the moment (except going back to the older bindings style, but let's see if @gingerBill finds a fix first :)

EriKWDev commented 1 year ago

Just wanted to let you know that this is still a problem with latest odin version and sokol-odin, but I am now getting a different error:

> odin version
odin version dev-2023-02:16262800
> odin run ../examples/clear
*** stack smashing detected ***: terminated

Probably just a different symptom of the same issue though.

with odin run ../examples/clear -debug and gdb:

clear.bin: c/sokol_gfx.h:15687: sg_setup: Assertion `(desc->_start_canary == 0) && (desc->_end_canary == 0)' failed.
floooh commented 1 year ago

Hmm ok, thanks for the update. I was just tinkering with the bindings yesterday again to integrate the new logging (but this was on macOS, where the problem doesn't exist).

If it doesn't look like the problem can be solved on the Odin side, it probably makes sense to go back to the old method which didn't use #by_ptr... I'll try to make some time experimenting with this over the next days.

thePHTest commented 1 year ago

I can take a look over the next few days to fix the ABI with #by_ptr on Linux.

thePHTest commented 1 year ago

Made some progress today and am discussing it in the #bugs_discussion channel on the Odin discord. I should have an update in another day or so.

thePHTest commented 1 year ago

@EriKWDev This should be fixed now if you use the latest Odin commit.

EriKWDev commented 1 year ago

Can confirm that it now works for me with the latest Odin compiler commit:

> odin version
odin version dev-2023-02:233f47cc

Great work!!

floooh commented 1 year ago

Agreed, great work @thePHTest!