libevent timer crash - Githubissues

maxim-ky commented 4 years ago

Hello guys.

After updating the aerospike-client-c from 4.1.2 to 4.6.13, we got a new crash (SIGSEGV). It happens very rarely (about 1 time a week) and it doesn't reproduce in a test environment.

The stacktraces vary but they always indicate a corrupted timer event struct in the event_loop. It seems like an event_add was called, but not event_del and since then the memory was rewritten so event struct contains random garbage.

I tried to update the client to 4.6.16, but no luck.

Adding these checks for an abandoned timer also gave nothing,

--- src/include/aerospike/as_event_internal.h   2020-05-26 14:56:29.956612519 +0300
+++ src/include/aerospike/as_event_internal.h   2020-05-26 14:48:06.475454901 +0300
@@ -519,6 +519,9 @@
 static inline void
 as_event_timer_once(as_event_command* cmd, uint64_t timeout)
 {
+        if((cmd->flags & AS_ASYNC_FLAGS_HAS_TIMER) && evtimer_pending(&cmd->timer, NULL))
+                abort(); 
+
    evtimer_assign(&cmd->timer, cmd->event_loop->loop, as_libevent_timer_cb, cmd);
    struct timeval tv;
    tv.tv_sec = (uint32_t)timeout / 1000;
@@ -530,6 +533,9 @@
 static inline void
 as_event_timer_repeat(as_event_command* cmd, uint64_t repeat)
 {
+        if((cmd->flags & AS_ASYNC_FLAGS_HAS_TIMER) && evtimer_pending(&cmd->timer, NULL))
+                abort();
+
    event_assign(&cmd->timer, cmd->event_loop->loop, -1, EV_PERSIST, as_libevent_repeat_cb, cmd);
    struct timeval tv;
    tv.tv_sec = (uint32_t)repeat / 1000;
@@ -786,6 +792,9 @@
 as_event_command_destroy(as_event_command* cmd)
 {
    // Use this function to free batch/scan/query commands that were never started.
+        if((cmd->flags & AS_ASYNC_FLAGS_HAS_TIMER) && evtimer_pending(&cmd->timer, NULL))
+                abort();
+
    as_node_release(cmd->node);
    cf_free(cmd);
 }
--- src/main/aerospike/as_event.c   2020-05-26 14:56:29.960612517 +0300
+++ src/main/aerospike/as_event.c   2020-05-26 14:50:31.048085326 +0300
@@ -1375,6 +1375,10 @@
        cf_free(cmd->buf);
    }

+
+        if((cmd->flags & AS_ASYNC_FLAGS_HAS_TIMER) && evtimer_pending(&cmd->timer, NULL))
+                abort();
+
    cf_free(cmd);

    if (event_loop->max_commands_in_process > 0 && ! event_loop->using_delay_queue) {

As the next round of compulsive trying to figure out the problem, I added some debug check to the libevent source code to bookkeep all the timers whatsoever. And since this bookkeeping wasn't thread-safe, the as_event_set_single_thread(true); was also added.

And the crash was fixed. So as_event_set_single_thread(true); was the solution. And since it works just fine in our single-threaded app, I stop the investigation at this point. The final guess is a race condition in the as_event_balance_connections thread (it's a pure guess, no clues opting for this one).

If you'd like to proceed with this one, I'm willing to help with checks, but a few limitations arise since it's a production environment and the event is rare:

We can't add much debug output since the high load.
We can't slow down the app significantly (otherwise Valgrind or ASan would be helpful).
We can't deploy it too often so a few debug builds are OK, but a bunch of them is rather no.

BrianNichols commented 4 years ago

There is a known race condition in 4.6.13 regarding async retries. The scenario was:

1) Socket error occurs on cmd. 2) Retry is initiated by adding wakeup/retry event to libevent queue. This is done to allow other commands to run before executing retry on next event loop iteration. 3) Timeout event is triggered which frees the cmd. 4) The wakeup/retry event is then triggered which references the freed cmd.

This bug was fixed in 4.6.14 by disabling timeout timer until retry event is triggered. When triggered, the timeout is checked, timeout timer is restored and cmd is retried. I'm surprised that versions >= 4.6.14 did not fix your problem. In any case, do not use 4.6.13.

as_event_set_single_thread(true) has the opposite effect of what you stated. This setting eliminates the evthread_use_pthreads() call and eliminates the mutex locks applied to every libevent command. This will cause your application to run faster, but you must ensure that there are no cross-thread commands directed towards your event loop.

The first step is to understand why as_event_set_single_thread(true) would solve anything other than improving performance.

maxim-ky commented 4 years ago

We tried 4.6.16, the same crash.

The guess was about as_cluster_tender thread. It calls as_event_balance_connections in a separate thread only if as_event_single_thread's not set.

https://github.com/aerospike/aerospike-client-c/blob/cfdfdd1d9048b4d85dbdff4f943e827f7feaf127/src/main/aerospike/as_cluster.c#L519

BrianNichols commented 4 years ago

as_event_balance_connections() is not called from the cluster tend thread in as_event_single_thread mode because that would be a cross-thread call. If as_event_single_thread is true, the user is now expected to call the following functions in the event loop thread:

1) as_event_loop_register_aerospike(as_event_loop event_loop, aerospike as)

Register aerospike instance with event loop. A 30 second repeating timer is added that performs connection balancing for each registered aerospike instance on that event loop.

2) as_event_loop_close_aerospike(as_event_loop event_loop, aerospike as, as_event_close_listener listener, void* udata)

Unregister and free aerospike instance resources associated with event loop. Listener is called when all aerospike instance async commands have completed on this event loop. Do not call aerospike_close() until listeners return on all event loops.

3) as_event_close_loop(as_event_loop* event_loop)

Close internal event loop and release internal/external event loop watchers. This optional method can be used instead of as_event_close_loops(). If used, must be called from event loop's thread.

A single thread event loop example is provided in the Async C Tutorial.

If these functions are not called when as_event_single_thread is true, min connections (as_config.async_min_conns_per_node) and trimming idle connections (as_config.max_socket_idle) will be disabled.

The next step is to call these new functions using client version 4.6.16 and check if the crash is reintroduced. Let us know the results.

maxim-ky commented 4 years ago

A week ago I added as_event_loop_register_aerospike call according to the example. Doesn't seem to crash. Any suggestions for further research?

Also, I'm wondering if not calling as_event_loop_register_aerospike can potentially cause some application-level problems? Like poor performance or using an excess amount of socket descriptors.

BrianNichols commented 4 years ago

If as_event_set_single_thread(true) is set and as_event_loop_register_aerospike() is not called, idle connections created at peak activity will not be trimmed down in periods of lower activity, thus causing an excess amount of open socket descriptors. Always call as_event_loop_register_aerospike() when as_event_set_single_thread(true) is set .

That said, it would be nice to know what caused the original crash. Stack traces can be helpful and a targeted debug build is usually the most helpful.

maxim-ky commented 4 years ago

OK, thanks for the clarification.

Stacktraces are useless because they all show random event_[add|del] calls (not related to aerospike-client). An only clue is all of them point to timer events. So it must be a corrupted timer event (dangling pointer) added to libevent.

It could be either a forgotten event_del before cf_free() or event_assign / evtimer_assign on an already added event. Is it possible to prepare a debug build where these scenarios are checked explicitly?

BrianNichols commented 4 years ago

I need to finish other tasks before preparing a debug build. It might be a while.

aerospike / aerospike-client-c

libevent timer crash #97