bitwiseworks / qtwebengine-chromium-os2

Port of Chromium and related tools to OS/2
9 stars 2 forks source link

Enormous amounts of pthread_mutex_t mutexes #31

Closed dmik closed 3 years ago

dmik commented 3 years ago

From https://github.com/bitwiseworks/qtwebengine-chromium-os2/issues/12#issuecomment-774335592:

Also sometimes I get ERROR_OUT_OF_MEMORY from DosCreateMutexSem in pthread_mutex_init for a reason not yet known.

Adding some simple code to pthread to calculate the number of pthread_mutex_init/pthread_mutex_destroy calls revealed really strange results. If I run Chromium in multi-process mode with three tabs open (HTML audio test, YouTube and CNN), I get these results by the time I close the browser app:

*** PTHR MUTEX CNT 576 MAX 12582

I.e., Chromium has 576 Mutex objects not destroyed with a peak of 12582 simultaneously existing mutexes! This is simply enormous. I don't know a single reason to have that many mutexes in an application. All of pthread_mutex_init ends up ina a DosCreateMutexSem call so no surprise I get ERROR_OUT_OF_MEMORY from it from time to time.

If I start the browser in single process mode, numbers are even more crazy:

*** PTHR MUTEX CNT 3441 MAX 18353

It looks like some severe bug somewhere in Chromium (I suppose). I definitely need to investigate that.

dmik commented 3 years ago

Further testing shows that it's Chromium's base::Lock which creates 99% of that many mutexes (this class is a mutex representation in Chromium). Which proves that it's Chromium who's guilty here. Now I need to see why it needs that many Lock classes (it won't be easy).

dmik commented 3 years ago

Actually, there isn't any practical way to analyze who's creating that many locks — they are used all over. What could help here is call stack traces in places where locks are instantiated. But there isn't any ready-to-use API to do so. We could generate .TRP reports with EXCEPTQ but I guess it will not survive creating a 1000 .TRP reports in one second on multiple threads... A profiling tool could also help but we don't have a good working one (and making it with such a big Chromium dll is a big question per se).

What I can try to do is optimize so that OS/2 mutexes are only created when first locked. Assuming that if the lock is just created and never actually locked, it will not waste an OS/2 mutex.

dmik commented 3 years ago

Just for the record, here I can create exactly 65512 private OS/2 mutexes and then DosCreateMutexSem returns ERROR_TOO_MANY_HANDLES. This is not what I get in Chromium (ERROR_NOT_ENOUGH_MEMORY) when having created like 20k mutexes or so. So I wonder what could it be then...

My current guess is that mutexes use some memory area shared with other system resources and in Chromium it gets exhausted before a "technical" limit of 65512 mutexes is reached.

dmik commented 3 years ago

Using PTHREAD_MUTEX_INITIALIZER in base::Lock constructor instead of a real pthread_mutex_init call (which results in a DosCreateMutexSem call) reduced the peak number number of simultaneously open mutexes by 2 (in two times). Not bad. But still, I really wonder why Chromium needs so many of them (and why many hundred remain not destroyed by the time the app ends).

dmik commented 3 years ago

Yes, Chromium just createst mutexes like hell. Especially in debug builds. There they have these DCHECK_CALLED_ON_VALID_THREAD and similar macros used virtually everywhere to check for thread safety (thread affinity). And each such check requires a class member created with THRAED_CHECKER macro. This instantiates a ThreadChecker class which creates a single mutex. So each instance of a class using THRAED_CHECKER macro will create a mutex. And there may be quite a few instances of some classes (thousands). Also, there is a more sophisticated checker, SEQUENCE_CHECKER used even more often, it seems. And this on uses TWO mutexes per each instance.

Of course, it's not a surprise that in debug builds the mutex count at some point quickly grows to the limit. Here it starts failing with ERROR_NOT_ENOUGH_MEMORY after having ~32k simultaneous mutexes. Really looks like some OS/2 system buffer is run out of space.

In release builds the situation is much better — much less mutexes. But still, a dozen thousand is quite a bit. It's a potential point of failure on OS/2. And a sign of bad design in Chromium. Interestingly enough, there is an article https://www.chromium.org/developers/lock-and-condition-variable advising against using base::Lock and base:: ConditionVariable primitives unless absolutely necessary. But there are already so many uses for it in existing classes... They protect data members there from unsafe concurrent access but as I said above many instances of such classes may be created.

StevenLevine commented 3 years ago

A couple of comments. 64K is a magic number in a lot of OS/2 APIs because much of the kernel is still 16-bit. We are limited to 65K memory objects. We are limited to 65K mutexes. The out of memory may be a case of running out of memory objects while attempting to create the mutexes.

You have a couple of tracing options. The OS/2 trace facility will allow you to monitor mutex activity. You might want to try my TrcTools which lives at www.warpcave.com/os2diags. It's intended to be easier to configure than what's described in \os2\system\ras\trace.doc. If you want to try it, let me know which APIs you need to trace and I will build a .cfg file to match and save you some learning curve.

The other tracing option is Dave Blaschke's os2trace utility. It's a bit more complicated to configure, since it modifies the executables to be traced. However, once you understand what's needed it's easy enough. Since the APIs will be in some combination of pthr01, libcx0 and libcn0, the testing will need to be done under LIBPATHSTRICT. The benefit of os2trace is that the output is much easier to analyze. Take a look at os2traceenv*.cmd in \www.warpcave.com\betas. It intended to make os2trace a bit easier to use. I also some some scripts I've used to set up os2trace for tracing kLIBC and friends.

In terms of why there are so many mutexes, often there's no good reason. Linux programmers seem to think more mutexes allows for more parallelism, which is often not true. Gregg and I ran into problems with Lucide which created a mutex for every document object. Large pdfs often have more than 65K objects. Making the mutex usage sane resolved the problem.

The authors of the FAT32.IFS decided that every cache object needed a mutex. Which the cache was not big enough to run the system out of mutexes, the mutexes did not protect what needed to be protected. Gregg and I made them go away and implemented the 3 or 4 mutexes that were actually needed.

dmik commented 3 years ago

@StevenLevine thanks for the tracing hints! I will really look into it and let you know if I need more help.

Re sane mutex usage, yes, I will have to do the same. I have an idea already — I will use LIBC fmutex in the pthread_mutex implementation. fmutex is cheaper than OS/2 mutex because it doesn't use it at all, it uses an OS/2 semaphore (which is cheaper) and also, in case of lazy init mode, the semaphore is created only when really needed — i.e. only when two threads happen to attempt to request the mutex at the same time. I expect that in Chromium in many cases there is only one thread actually requesting it, so no OS/2 object will be created at all in this case (fmutex uses atomics for such simple mode). And in Chromium I will just use PTHREAD_MUTEX_INITIALIZER to cause lazy creation of pthread_mutex itself. I wonder why it's not already done this way there... I will try monitor mutex activity with the tools you point to — as otherwise I won't be able to easily see if the new solution helps here.

dmik commented 3 years ago

Using _fmutex in pthread (see bitwiseworks/pthread-os2#9) makes it work much better, even debug builds with several huge tabs survive now.

I also changed my mind and will not use PTHREAD_MUTEX_INITIALIZER in Chromium — further testing shows it doesn't save a lot because it seems that most mutexes created in Chromium are attempted to get locked right away. And in debug builds there is some additional setup (PTHREAD_MUTEX_ERRORCHECK which our pthread doesn't support now but might support later).

So, closing this.