Fix rare segfault on wait_for_update

LinusCDE commented 2 years ago

Found this, when using this implementation for plato:

grafik

The cause was that sem_timedwait was called on sem_open's result SEM_FAILED. That call failed with errno == 22 ("Invalid value"). Seems that the rust code called the code too fast and and probably made more likely by more frequent wait()'s in plato due to the ui system. Plato uses a bus system for almost everything, so afaik, calls should never be happen in parallel.

The implemenation was 1:1 like the reference implementation. While this didn't ever happen on the official rm2fb client, the cause might still be interesting to investigate and could possibly hint to a problem in the original impl. (cc @raisjn)

Added a way to do several attempts with delays which should provide a good enough workaround. Also turned later occuring segfault into a proper panic, should 3 attempts/10ms delay be not enough for some app.

raisjn commented 2 years ago

interesting find! it's a bit surprising that it happens here. rm2fb should have NULL ptr handling added to make sure it doesn't also segfault, i wonder what's happening. EINVAL for sem_open() indicates something weird is going on.

fenollp commented 2 years ago

LGTM! Just maybe use a u8 instead of usize for the counter...

raisjn commented 2 years ago

i'm still curious about the cause here - can you add print statements to help diagnose / see what the requested shm path is? if it isn't happening in rm2fb (we would see segfaults in plato if it was, right?), it is likely better to figure out and fix the underlying cause.

EINVAL is given back when the name is invalid in shm_open()

LinusCDE commented 2 years ago

Reverted the attempt stuff and added a lot of print debugging (commit https://github.com/LinusCDE/libremarkable/commit/ae72c35c7df43f7c0fc7192e4faa841629318533).

Built current plato (master branch) against this lib: dist.tar.gz. You can just extract this on device and run the .plato.sh.

If you wish to see the exact sysv messages sent, you can add this spy lib between: libswtfb_sysv_spy.so.xz (Source)
Is also a libchook, you can add with LD_PRELOAD=/home/root/libsysv_spy.so. You can also add both this and the rm2fb client like this: LD_PRELOAD="/home/root/libsysv_spy.so /opt/lib/librm2fb_client.so.1" (used this originally for fixing bugs when implementing the client).

This really seems to be something that is really timing sensitive. Because running the binary with print statements works perfectly fine. You can see for yourself by replacing the "plato" binary with this one. The only change is that I just commented out every new println! statement.

It is also probably noteworthy to mention, that println! on rust by default locks a mutex. So it will probably cause a slightly longer delay than std::cout or printf because of the extra synchronization.

LinusCDE commented 2 years ago

The new commit should fix the issue for good. Seems that the O_CREAT needs the mode for creating a shared memory file if it doesn't exist. If the code waited a bit longer after the sysv message, the update would be complete and the server side would have created the semaphore correctly. The client didn't specify these arguments and if it actually needed to create it (because executed first), the code would error, possibly on missing/wrong arguments for creating the file.

Not sure why the c++ client/wrapper didn't cause this. My only guess would be that the wrapper uses Entware's libc (currently 2.27) which defaults to some correct values if they are not specified. Since we usually use a generic compiler for rust on arm, the libc crate is most likely the system one (currently 2.31) which might just behave differently when the arguments.

Specifying 0o644 as on the server side fixes the issue. @raisjn this might also be a nice upstream change. Why not mandatory this is probably better to make it explicit to prevent problems when compiling with/for different glibc/pthread versions. Also, do you know the use of the 4th argument (0). I doesn't seem to affect anything so I left it out for now.

screenshot4413

Edit: Mixed up which libc is which version. This seems to be more like a, we removed this default behaviour for clarification. So updating it for the c++ client is definitely recommended as it might be a case of future proofing, should entware's libc update.

LinusCDE commented 2 years ago

Had some weird behaviour. Not sure if really needed, added the 0 argument just to be sure.

canselcik / libremarkable

Fix rare segfault on wait_for_update #84