DOCGroup / ACE_TAO

ACE and TAO
https://www.dre.vanderbilt.edu/~schmidt/TAO.html
701 stars 380 forks source link

SSLIOP Transports are used after being cached with entry state IDLE #2184

Open eommc opened 9 months ago

eommc commented 9 months ago

Version

We are using the ACE ORB supported by OCI, https://theaceorb.com/, wich is TAO 2.2. But looking at the code here on github, it seems that the code is the same as the one on master branch here, with TAO 3.1.

Host machine and operating system

Built on an almalinux 9.2 docker image, run on Red Hat Enterprise Linux release 9.3

Compiler name and version (including patch level)

g++ (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)

The $ACE_ROOT/ace/config.h file

#if defined FD_SETSIZE
#       undef FD_SETSIZE
#endif
#define FD_SETSIZE 8192
#define ACE_HAS_IPV6 1
#define TAO_HAS_ZIOP 1
#include "config-linux.h"

The $ACE_ROOT/include/makeinclude/platform_macros.GNU file

exceptions=1
optimize=0
threads=1
ssl=1
zlib=1
include $(ACE_ROOT)/include/makeinclude/platform_linux.GNU
buildbits = 64

Contents of $ACE_ROOT/bin/MakeProjectCreator/config/default.features

ssl=1
zlib = 1

Our application acts as a server for various other programs, but also delegates a few functions to another service. In this interaction, the application is acting as a client towards the other service, and multiple threads can perform requests concurrently. The ORBs use an svc config file with the following content:

    static Client_Strategy_Factory "-ORBClientConnectionHandler rw -ORBTransportMuxStrategy exclusive -ORBConnectStrategy blocked -ORBConnectionHandlerCleanup 1"
    static Resource_Factory "-ORBFlushingStrategy blocking -ORBProtocolFactory SSLIOP_Factory"
    dynamic SSLIOP_Factory Service_Object*
        TAO_SSLIOP:_make_TAO_SSLIOP_Protocol_Factory() "-v -SSLAuthenticate SERVER_AND_CLIENT -SSLCertificate ... -SSLPrivateKey ... -SSLPassword ... -SSLCAfile ... -SSLVersionList TLSv1.2"

We've seing a few threads remaining stuck waiting for a replies, and these errors in the logs:

    Exclusive_TMS::dispatch_reply - <2 != 1>

I've raised the log levels and investigated the issue. Whent the SSLIOP connector creates a new Transport, it is added to the Transport Cache Manager with state ENTRY_IDLE_AND_PURGABLE. Under heavy load, a different thread can fetch the same Transport from tha cache while the first thread is using it to send a request.

There are a few consequences:

Analyzing the code I found a big difference between IIOP and SSLIOP connections.

Both TAO_IIOP_Connector and TAO::SSLIOP::Connector inherit from TAO_Connector. In the former case, the transport creation passes through the base class, TAO_Connector::connect. This method has an infinte while loop that wraps all code. If an idle transport is found in the cache, it is returned. When a new Transport is created, it is not returned to the caller, instead it is cached and then the loop runs once again, and hopefully fetches the transport from the cache before a different thread snatches it. The Transport is created by TAO_IIOP_Connector::make_connection, which adds the transport in the cache manager using the default sate, ENTRY_IDLE_AND_PURGABLE (a default paramter). The state of the transport is changed to BUSY later, by the method Transport_Cache_Manager_T::find_transport, which ends up calling find_i.

    // Transport_Cache_Manager_T.cpp : 301
    entry->item ().recycle_state (ENTRY_BUSY);

On the other hand, TAO::SSLIOP::Connector overwrites the connect method (SSLIOP_Connector.cpp line 83), which internally calls ssliop_connect (line 190). Neither the method connect, nor ssliop_connect, have the same loop as the base class. Instead, the transport is cached with the default state ENTRY_IDLE_AND_PURGABLE, and then returned and used as is

    int retval =
      this->orb_core ()->
        lane_resources ().transport_cache ().cache_transport (desc,
                                                              transport);

With this pattern it is possible that a different thread finds the same transport in the cache, in idle state, and starts using it concurrently.

I tried a quick patch, adding the state ENTRY_BUSY as third parameter to the cache_transport invocation, in SSLIOP_Connector.cpp line 713, and our problem seems to be solved. All I saw seems to hint to a bug, but the code is 13 years old, so maybe there's something wrong in our configuration.