Theldus / wsServer

wsServer - a tiny WebSocket server library written in C
https://theldus.github.io/wsServer
GNU General Public License v3.0
422 stars 80 forks source link

Ping - pthread_mutex_lock called on a destroyed mutex #65

Closed mikegapinski closed 1 year ago

mikegapinski commented 1 year ago

I am not a C/C++ maestro in any way and I've been fighting with adding ping support to my app.

My program has a main loop that broadcasts a message to all connected clients 30 times per second. This works fine.

Now I need to remove inactive clients as soon as possible(pings)

What I tried:

With both approaches I'm getting this error right after launching the program: FORTIFY: pthread_mutex_lock called on a destroyed mutex (0x7817969390)

I'm running this on Android (https://github.com/tesla-android/android-external-libws), but this shouldn't change anything -> I just provided a custom makefile for AOSP.

The program I want to run with wsServer is opensource https://github.com/tesla-android/android-external-tesla-android-virtual-display/blob/main/tesla-android-virtual-display.cpp

I just replaced the mjpeg streamer with wsServer: https://pastebin.com/cKZnXu1n

I would very much appreciate any suggestions on how to get rid of this problem!

Thank you for creating this awesome library @Theldus, it's very easy to use and well documented!

mikegapinski commented 1 year ago

I've pushed the wsServer version to a separate branch for convenience: https://github.com/tesla-android/android-external-tesla-android-virtual-display/commit/e3b9dd37bd50b5f87e4e6a5987db40dc7c6d7da4

Theldus commented 1 year ago

Hi @mikegapinski, thanks for giving wsServer a try, Sorry for taking too long to answer you,

Looks like there is a race-condition in the close_client() function, at least, there is the only place that this might occur... However, I haven't tracked down yet the origin of this.

Since race-conditions are hard to reproduce, it might take a while to discover whats causing it.

Your repo: https://github.com/tesla-android/android-external-tesla-android-virtual-display/tree/next, occurs this? if so, how can I test it?

mikegapinski commented 1 year ago

My repo builds only under AOSP so it won't be that easy for you to build.

I can remove the android specific bits, verify that it is still reproducible under Linux(Android has it's own libc) and share the code with you here, would that be sufficient?

There is also a possibility that you weren't able to reproduce this before since the standard prhread implementation doesn't throw this race condition, no clue honestly. (https://android.googlesource.com/platform/bionic/+/9e989f12d1186231d97dac6d038db7955acebdf3%5E%21) I get this error every time I try to close the client, doesn't matter where I execute it(I tried placing this in the library itself in the loop that accepts new clients by iterating over the existing ones and closing them)

mikegapinski commented 1 year ago

I pulled the trigger on reverting the libc patch and... it works now. There is an issue in the close function in the library somewhere but I can't find it sadly :/

image
Theldus commented 1 year ago

Hi @mikegapinski,

I can remove the android specific bits, verify that it is still reproducible under Linux(Android has it's own libc) and share the code with you here, would that be sufficient?

Yes! building on NDK or Termux would be more than enough to me.

There is also a possibility that you weren't able to reproduce this before since the standard prhread implementation doesn't throw this race condition, no clue honestly.

Makes sense, looks like glibc doesn't have this type of check, even with FORTIFY enabled.

I get this error every time I try to close the client, doesn't matter where I execute it(I tried placing this in the library itself in the loop that accepts new clients by iterating over the existing ones and closing them)

This is pretty weird, so it doesn't seem to have anything to do with the ping specifically, just the fact that it's trying to close multiple clients at once, right?

Do you have any minimal examples of where this occurs? it would help a lot.

I pulled the trigger on reverting the libc patch and... it works now. There is an issue in the close function in the library somewhere but I can't find it sadly :/

That's good, but sad at the same time... I'll keep trying to reproduce/understand what's happening here =).

mikegapinski commented 1 year ago

I forgot about the NDK, I have another bit of Tesla Android that I built with the NDK before migrating it to AOSP directly. I should have it somewhere still, I'll find it and get back to you

mikegapinski commented 1 year ago

@Theldus this should build both with NDK and Termux(but I am not 100% sure if termux uses libc from AOSP internally...)

https://github.com/tesla-android/android-external-tesla-android-virtual-touchscreen/blob/main/tesla-android-virtual-touchscreen.c

There are no external dependencies, it's a simple relay that writes the messages from WS to the virtual device exposed from my kernel

Theldus commented 1 year ago

Hi @mikegapinski,

Thanks, with your last link I was able to reproduce the issue on my phone, via Termux.

Adding a pair of macros like this:

#define pthread_mutex_lock(m) \
    do { \
        fprintf(stderr, "Locking mutex: %p (" #m "), line: %d\n", m, __LINE__); \
        pthread_mutex_lock((m)); \
    } while (0) 

#define pthread_mutex_unlock(m) \
    do { \
        fprintf(stderr, "Unlocking mutex: %p (" #m "), line: %d\n", m, __LINE__); \
        pthread_mutex_unlock((m)); \
    } while (0) 

#define pthread_mutex_destroy(m) \
    do { \
        fprintf(stderr, "Destroying mutex: %p (" #m "), line: %d\n", m, __LINE__); \
        pthread_mutex_destroy((m)); \
    } while (0)

I was able to track down where the issue occurs:

$ ./examples/ping/ping 
Waiting for incoming connections...
Locking mutex: 0x5dca463518 (&mutex), line: 1587
Unlocking mutex: 0x5dca463518 (&mutex), line: 1610
Locking mutex: 0x5dca462dc4 (&client->mtx_snd), line: 283
Unlocking mutex: 0x5dca462dc4 (&client->mtx_snd), line: 295
Locking mutex: 0x5dca462d60 (&client->mtx_state), line: 249
Unlocking mutex: 0x5dca462d60 (&client->mtx_state), line: 251
Connection opened, addr: 192.168.100.4
Locking mutex: 0x5dca462d60 (&client->mtx_state), line: 223
Unlocking mutex: 0x5dca462d60 (&client->mtx_state), line: 225
Locking mutex: 0x5dca463518 (&mutex), line: 688
Locking mutex: 0x5dca462d60 (&client->mtx_state), line: 223
Unlocking mutex: 0x5dca462d60 (&client->mtx_state), line: 225
Locking mutex: 0x5dca462e24 (&cli->mtx_ping), line: 630
Locking mutex: 0x5dca462dc4 (&client->mtx_snd), line: 283
Unlocking mutex: 0x5dca462dc4 (&client->mtx_snd), line: 295
Unlocking mutex: 0x5dca462e24 (&cli->mtx_ping), line: 642
Locking mutex: 0x5dca462e58 (&client->mtx_state), line: 223   <<<-- here
FORTIFY: pthread_mutex_lock called on a destroyed mutex (0x5dca462e58)
Aborted

Which points to the line 205 of the original code (branch master):

static int get_client_state(ws_cli_conn_t *client)
{
    int state;

    if (!CLIENT_VALID(client))
      return (-1);

    pthread_mutex_lock(&client->mtx_state); <- here
    state = client->state;
    pthread_mutex_unlock(&client->mtx_state);
    return (state);
}

This specific mutex (mtx_state) does not seem to be destroyed, so I still need to figure out why I'm getting this error, but yes, I've made a little progress.

EDIT1: A closer look at the mtx_state address one can note that the address change in the latest call, from 0x5dca462d60 to 0x5dca462e58; so maybe I'm getting a buffer overflow somewhere.

Any news I'll let you know =).

mikegapinski commented 1 year ago

I am pleasantly surprised that you even managed to pinpoint the problematic location. Kudos!

I guess it would make sense to add my Android.bp makefile to your repository after this gets resolved. I have not seen a lot of easy to use web socket libs for the NDK, AOSP comes with libwebsockets but it's overkill for most people

Theldus commented 1 year ago

@mikegapinski Fixed in d0961b3.

The problem was my macro CLIENT_VALID() not properly checking if a client was valid or not. Due to that, ws_ping()/send_ping_close() was attempting to send a ping frame to an invalid client, which clearly have invalid mutexes as well =).

Please let me know if this commit fixes your issue =).

I guess it would make sense to add my Android.bp makefile to your repository after this gets resolved. I have not seen a lot of easy to use web socket libs for the NDK, AOSP comes with libwebsockets but it's overkill for most people

Thanks. This has always been my intention: to have an easy-to-use WebSockets library for C, I'm glad this still holds true over the years =).

I'm not familiar with AOSP, but from what I've read here, it appears that AOSP projects have their own version of the "Makefile." I'm not sure if anyone else would use wsServer for this purpose besides you, but feel free to submit a PR =).

mikegapinski commented 1 year ago

@Theldus Rebuilding with your patch + reversed libc change. About the Makefile for AOSP, it's nothing fancy: https://github.com/tesla-android/android-external-libws/blob/main/Android.bp

But one thing for sure, not a lot of people cook their own versions of Android on a daily basis, it would make more sense to prepare the library for the NDK. I'll take a look into it at some point when I have free time. Now we know that it'll work well on Android so it makes sense to include a ready to use version for others to enjoy

mikegapinski commented 1 year ago

@Theldus I verified your patch, works 100%. Closing this issue. Thanks for your help!