When testing ats92 compared with ats91 on our linear only 2x100GB 64/128 core ram disk only edge caches we noticed a ramping of established connections to other mid tier caches in our network.
The issue seems to be that the session pool mutexes are overly contested. high enough numbers of transactions are failing to get the mutex lock and create their own new session while too many sessions that are being put into the pool are hanging around until keep_alive_timeout_out expires. Changing from using a global pool to a hybrid mitigates some but only extends the overhead a little bit (until the global pool itself starts being contested). Using hybrid along with lowering the keep alive helps more but increases session churn.
For fun I changed inside HttpSessionManager.cc I changed the MUTEX_TRY_LOCK over to SCOPED_MUTEX_LOCK which totally eliminated any excess connections and statistically had no effect on ttfb (actually it ended up being slightly lower).
With ats91 the session keep_alive_timeout_out wasn't being properly honored so sessions in the pool were being cleaned up early, seemingly fixed in ats92. (edit: this part probably isn't true)
When testing ats92 compared with ats91 on our linear only 2x100GB 64/128 core ram disk only edge caches we noticed a ramping of established connections to other mid tier caches in our network.
The issue seems to be that the session pool mutexes are overly contested. high enough numbers of transactions are failing to get the mutex lock and create their own new session while too many sessions that are being put into the pool are hanging around until keep_alive_timeout_out expires. Changing from using a global pool to a hybrid mitigates some but only extends the overhead a little bit (until the global pool itself starts being contested). Using hybrid along with lowering the keep alive helps more but increases session churn.
For fun I changed inside HttpSessionManager.cc I changed the MUTEX_TRY_LOCK over to SCOPED_MUTEX_LOCK which totally eliminated any excess connections and statistically had no effect on ttfb (actually it ended up being slightly lower).
With ats91 the session keep_alive_timeout_out wasn't being properly honored so sessions in the pool were being cleaned up early, seemingly fixed in ats92. (edit: this part probably isn't true)