bitwiseworks / libcx

kLIBC Extension Library
GNU Lesser General Public License v2.1
11 stars 1 forks source link

Make shared mutex / memory name depend on DLL module handle #99

Closed dmik closed 3 years ago

dmik commented 3 years ago

LIBCx uses named shared mutex and memory objects so that all LIBCx processes on the system could find these singleton resources by name.

However, I often run two versions of LIBCx located in different directories (thanks to LIBPATHSTRICT=T) when developing / debugging / testing / building RPMs / etc. One version is usually a normal, system-installed version from RPM and the other one is the development build. If they happen to use the same mutex and memory names, they will find the same mutex and memory objects. Most of the time it works but there are many cases when it will lead to very strange failures and crashes. Usually this happens because the development version adds incompatible changes to the shared memory layout so it may corrupt the data used by the other version (and vice versa).

But this is not the only case. Even if the data layout fully matches, things may go wrong under some other circumstances. One of them is that this shared data may refer to some static objects from the shared data (or even code) section of the DLL. A simple example of such an object is static const strings (for data) and functions (for code).

If one DLL leaves references to its data/code objects in the shared memory block and these are then accessed by the other DLL, weird thins may happen because the other DLL may not even have access to these objects (because it was loaded by a process that did NOT load the other DLL, or because the other DLL is simply gone now). In case of LIBCx it may lead to some assertion in the system version of the DLL and all system processes using it will immediately die. A very terrible situation.

I already tried to solve this back in the day by appending the LIBCx version to the mutex and shared memory names as well as the build type type (release/debug and dev/production), see b1964198841c6d634c82b9ea2b0ae5c32bc88471 and a4f961821305b279f9cb3374f110658829718eed. It helps most of the time, but not always. E.g. when I install a release dev build to the system DLL directory and then run some tests with LIBPATHSTRICT=T and BEGINLIBPATH pointing to the development tree without first increasing the LIBCx version and rebuilding it.

This leads to a situation when two DLLs with identical version signatures but living in different directories end up being loaded at the same time. Of course, they use the same mutex / shared memory and weird things start happening.

dmik commented 3 years ago

Getting tired of that, I came with a very simple idea: append the DLL module handle value to the mutex / shared memory names! This will make sure the names will be different for DLLs loaded into memory from different directories because each of them will get a distinct module handle. And having different names will lead to different mutex / shared memory objects being created so that these two copies will not even notice they coexist in memory and therefore will not conflict in any way.

At least, with this improvement applied, my problems with running tests of the very same version are completely gone. I will give it some more testing.

I also think we need the very same thing for LIBC which has the very same problem so I will create a separate ticket there.

dmik commented 3 years ago

Note that having this fixed will not only affect developers: it will affect users trying to run different copies of **exactly the same*** LIBCx DLL but located in different directories via LIBPATHSTRICT=T. We definitely need to fix it.

dmik commented 3 years ago

JFTR, here is the crash which I get here when I run a test that loads a non-system copy of the DLL while the system copy is already in memory:

LIBC PANIC!!
_um_free_maybe_lock: Tried to free block twice - block=befebe80 lock=0x1
pid=0x0088 ppid=0x0085 tid=0x0001 slot=0x0097 pri=0x0200 mc=0x0000 ps=0x0017
D:\CODING\LIBCX\MASTER-BUILD\STAGE\BIN\TST-FLOCK2.EXE
Creating 0088_01.TRP
Moved 0088_01.TRP to 6120259d-0088_01-TST-FLOCK2-exceptq.txt

Among other things, LIBCx places a custom C heap in the shared memory block using the _ucreate EMX API. This API takes callbacks that allocate and release memory and LBICx supplies it an allocation callback that commits more shared memory as needed. But this callback belongs to the other DLL and something may go wrong because of that. Though it's not really clear what exactly, so far.

Another thing is that process 0088 is the forked child of another TST-FLOCK2 instance (0085). So it indicates that this might be somehow related to how LIBC does forking. This involves copying private data of each DLL into the forked child's address space but since there are two copies of the same DLL in memory (system and non-system) something may go wrong along these lines and the new process ends up with some wrong heap data so it tries to free the block twice. This leads to a crash while holding the LIBCx global lock. And since this lock is shared between the system and non-system LIBCx DLLs it kills all processes using either of them.

More debugging is needed in order to reconstruct the exact cause. I guess I will postpone it for some later time given that using a completely separate mutex and memory object solves the issue.