aerospike / aerospike-client-c

Aerospike C Client
Other
98 stars 103 forks source link

crash on exit in as_cluster_tender (after libuv event loop is stopped) #149

Open bsergean opened 8 months ago

bsergean commented 8 months ago
0   server                    0x131064b           gsignal (raise.c:51)
1   server                    0xba03c2            abort 
2   server                    0x5f3fb8            [inlined] uv__async_send (async.c:198)
3   server                    0x5f3fb8            uv_async_send.cold (async.c:73)
4   server                    0x213c818           as_event_execute (as_event_uv.c:246)
5   server                    0x213b26b           as_event_balance_connections (as_event.c:1846)
6   server                    0x212f7f5           [inlined] as_cluster_balance_connections (as_cluster.c:632)
7   server                    0x212f7f5           as_cluster_manage (as_cluster.c:653)
8   server                    0x212fe8d           as_cluster_tend (as_cluster.c:885)
9   server                    0x21304a0           as_cluster_tender (as_cluster.c:935)
10  server                    0x1304c48           start_thread (pthread_create.c:477)
11  server                    0x83aa4c2           __clone
  1. We use libuv (recent version)
  2. On shutdown we sequentially call:
        aerospike_destroy( cluster );
        LOG_INFO( "Closing aero event loops" );
        as_event_close_loops();

Before our event loop gets stopped. Any idea of what's going on ? If feels like after calling as_event_close_loops the cluster_tend mechanic should stop (and that thread exit).

BrianNichols commented 8 months ago

Do you call aerospike_close() before calling aerospike_destroy()?

aerospike_close() should perform a graceful shutdown of the cluster while aerospike_destroy() just frees cluster memory. aerospike_destroy() alone does not attempt to stop the cluster tend thread.

bsergean commented 8 months ago

Yes we do, this is what our cleanup looks like.

          as_error err{};
          as_error_reset( &err );
          auto * cluster = static_cast<aerospike *>( mInternalObject );

           if ( aerospike_close( cluster, &err ) != AEROSPIKE_OK )
          {
              LOG_ERROR( "Could not close connection to aerospike: error({}) {} at [{}:{}]", static_cast<int>( err.code ), err.message, err.file, err.line );
          }

          aerospike_destroy( cluster );cleaner]

          LOG_INFO( "Closing aero event loops" );
          as_event_close_loops();

We changed the way we are closing our uv_loop, maybe this is our problem.

bsergean commented 8 months ago

Could it help to call

as_event_set_external_loop( ... );

with a null pointer to tell that the loop is gone ... ? Or maybe we call uv_run a few times to advance the event loop which could help the aerospike graceful termination.

BrianNichols commented 8 months ago

When are you closing the shared uv_loop?

If you are sharing libuv event loops with the C client via as_event_set_external_loop() or as_set_external_event_loop(), then closing those event loops must come after as_event_close_loops().

bsergean commented 8 months ago

Good point, we are calling uv_stop (loop) before our aero shutdown sequence. I think we need to reshuffle things.