Closed dmik closed 3 years ago
Further testing shows that it's Chromium's base::Lock
which creates 99% of that many mutexes (this class is a mutex representation in Chromium). Which proves that it's Chromium who's guilty here. Now I need to see why it needs that many Lock
classes (it won't be easy).
Actually, there isn't any practical way to analyze who's creating that many locks — they are used all over. What could help here is call stack traces in places where locks are instantiated. But there isn't any ready-to-use API to do so. We could generate .TRP reports with EXCEPTQ but I guess it will not survive creating a 1000 .TRP reports in one second on multiple threads... A profiling tool could also help but we don't have a good working one (and making it with such a big Chromium dll is a big question per se).
What I can try to do is optimize so that OS/2 mutexes are only created when first locked. Assuming that if the lock is just created and never actually locked, it will not waste an OS/2 mutex.
Just for the record, here I can create exactly 65512 private OS/2 mutexes and then DosCreateMutexSem
returns ERROR_TOO_MANY_HANDLES. This is not what I get in Chromium (ERROR_NOT_ENOUGH_MEMORY) when having created like 20k mutexes or so. So I wonder what could it be then...
My current guess is that mutexes use some memory area shared with other system resources and in Chromium it gets exhausted before a "technical" limit of 65512 mutexes is reached.
Using PTHREAD_MUTEX_INITIALIZER in base::Lock
constructor instead of a real pthread_mutex_init
call (which results in a DosCreateMutexSem
call) reduced the peak number number of simultaneously open mutexes by 2 (in two times). Not bad. But still, I really wonder why Chromium needs so many of them (and why many hundred remain not destroyed by the time the app ends).
Yes, Chromium just createst mutexes like hell. Especially in debug builds. There they have these DCHECK_CALLED_ON_VALID_THREAD and similar macros used virtually everywhere to check for thread safety (thread affinity). And each such check requires a class member created with THRAED_CHECKER macro. This instantiates a ThreadChecker
class which creates a single mutex. So each instance of a class using THRAED_CHECKER macro will create a mutex. And there may be quite a few instances of some classes (thousands). Also, there is a more sophisticated checker, SEQUENCE_CHECKER used even more often, it seems. And this on uses TWO mutexes per each instance.
Of course, it's not a surprise that in debug builds the mutex count at some point quickly grows to the limit. Here it starts failing with ERROR_NOT_ENOUGH_MEMORY after having ~32k simultaneous mutexes. Really looks like some OS/2 system buffer is run out of space.
In release builds the situation is much better — much less mutexes. But still, a dozen thousand is quite a bit. It's a potential point of failure on OS/2. And a sign of bad design in Chromium. Interestingly enough, there is an article https://www.chromium.org/developers/lock-and-condition-variable advising against using base::Lock
and base:: ConditionVariable
primitives unless absolutely necessary. But there are already so many uses for it in existing classes... They protect data members there from unsafe concurrent access but as I said above many instances of such classes may be created.
A couple of comments. 64K is a magic number in a lot of OS/2 APIs because much of the kernel is still 16-bit. We are limited to 65K memory objects. We are limited to 65K mutexes. The out of memory may be a case of running out of memory objects while attempting to create the mutexes.
You have a couple of tracing options. The OS/2 trace facility will allow you to monitor mutex activity. You might want to try my TrcTools which lives at www.warpcave.com/os2diags. It's intended to be easier to configure than what's described in \os2\system\ras\trace.doc. If you want to try it, let me know which APIs you need to trace and I will build a .cfg file to match and save you some learning curve.
The other tracing option is Dave Blaschke's os2trace utility. It's a bit more complicated to configure, since it modifies the executables to be traced. However, once you understand what's needed it's easy enough. Since the APIs will be in some combination of pthr01, libcx0 and libcn0, the testing will need to be done under LIBPATHSTRICT. The benefit of os2trace is that the output is much easier to analyze. Take a look at os2traceenv*.cmd in \www.warpcave.com\betas. It intended to make os2trace a bit easier to use. I also some some scripts I've used to set up os2trace for tracing kLIBC and friends.
In terms of why there are so many mutexes, often there's no good reason. Linux programmers seem to think more mutexes allows for more parallelism, which is often not true. Gregg and I ran into problems with Lucide which created a mutex for every document object. Large pdfs often have more than 65K objects. Making the mutex usage sane resolved the problem.
The authors of the FAT32.IFS decided that every cache object needed a mutex. Which the cache was not big enough to run the system out of mutexes, the mutexes did not protect what needed to be protected. Gregg and I made them go away and implemented the 3 or 4 mutexes that were actually needed.
@StevenLevine thanks for the tracing hints! I will really look into it and let you know if I need more help.
Re sane mutex usage, yes, I will have to do the same. I have an idea already — I will use LIBC fmutex
in the pthread_mutex
implementation. fmutex
is cheaper than OS/2 mutex because it doesn't use it at all, it uses an OS/2 semaphore (which is cheaper) and also, in case of lazy init mode, the semaphore is created only when really needed — i.e. only when two threads happen to attempt to request the mutex at the same time. I expect that in Chromium in many cases there is only one thread actually requesting it, so no OS/2 object will be created at all in this case (fmutex
uses atomics for such simple mode). And in Chromium I will just use PTHREAD_MUTEX_INITIALIZER to cause lazy creation of pthread_mutex itself. I wonder why it's not already done this way there... I will try monitor mutex activity with the tools you point to — as otherwise I won't be able to easily see if the new solution helps here.
Using _fmutex
in pthread (see bitwiseworks/pthread-os2#9) makes it work much better, even debug builds with several huge tabs survive now.
I also changed my mind and will not use PTHREAD_MUTEX_INITIALIZER in Chromium — further testing shows it doesn't save a lot because it seems that most mutexes created in Chromium are attempted to get locked right away. And in debug builds there is some additional setup (PTHREAD_MUTEX_ERRORCHECK which our pthread doesn't support now but might support later).
So, closing this.
From https://github.com/bitwiseworks/qtwebengine-chromium-os2/issues/12#issuecomment-774335592:
Adding some simple code to
pthread
to calculate the number ofpthread_mutex_init
/pthread_mutex_destroy
calls revealed really strange results. If I run Chromium in multi-process mode with three tabs open (HTML audio test, YouTube and CNN), I get these results by the time I close the browser app:I.e., Chromium has 576 Mutex objects not destroyed with a peak of 12582 simultaneously existing mutexes! This is simply enormous. I don't know a single reason to have that many mutexes in an application. All of
pthread_mutex_init
ends up ina aDosCreateMutexSem
call so no surprise I get ERROR_OUT_OF_MEMORY from it from time to time.If I start the browser in single process mode, numbers are even more crazy:
It looks like some severe bug somewhere in Chromium (I suppose). I definitely need to investigate that.