apache / couchdb-erlfdb

Erlang API for FoundationDB
https://www.foundationdb.org
Apache License 2.0
28 stars 20 forks source link

Possible issue with mutex not being destroyed #49

Closed leonardb closed 2 years ago

leonardb commented 2 years ago

Firstly, thanks for the bindings. I've been using them in production for quite a while without any serious issues, until now.

As our systems have scaled we noticed a marked increase in 'system' memory growth, until some servers were running out of memory. https://erlangforums.com/t/possible-memory-leak-with-a-nif/978/3

Digging a bit deeper and it looks like it may be an issue in the erlfdb_nif with a mutex not begin destroyed.

Using instrument.allocations().

       erlfdb_nif =>
           #{binary => {72,85,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_mutex =>
                 {2240960,206493,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {107,1717,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_internal => {0,0,0,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},

and 20 minutes later

           #{binary => {162,215,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_mutex =>
                 {4634393,423889,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {257,4976,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_internal => {0,0,0,89,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},

I may be approaching this incorrectly, but I can replicate the increase in mutex counts just by performing transaction set operations and the mutex count does not ever decrease.

1> I = fun() -> {ok, {_,_,#{erlfdb_nif := N}}} = instrument:allocations(), N end.
#Fun<erl_eval.45.79398840>
2> Gen = fun(Size) when is_integer(Size), Size > 1 -> RandBin = crypto:strong_rand_bytes(Size - 1), <<"asdf", RandBin/binary>> end.
#Fun<erl_eval.44.79398840>
3> Db = erlfdb:open(<<"fdb.local.cluster">>).
{erlfdb_database,#Ref<0.698868889.2384330753.87336>}
4> I().
#{binary => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
  driver_mutex => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
  driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}
5> [begin V = Gen(X), K = integer_to_binary(X,32), ok = erlfdb:transactional(Db, fun(Tx) -> erlfdb:set(Tx, K, V) end) end || X <- lists:seq(10,10000)].
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
 ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]
6> I().
#{binary => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
  driver_mutex => {9989,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
  driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}
7> [begin V = Gen(X), K = integer_to_binary(X,32), ok = erlfdb:transactional(Db, fun(Tx) -> erlfdb:set(Tx, K, V) end) end || X <- lists:seq(10,10000)].
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
 ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]
8> I().                                                                                                                                                
#{binary => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
  driver_mutex => {19980,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
  driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}
kocolosk commented 2 years ago

Hi @leonardb, glad to hear you're finding the project useful! This looks like an interesting one, I'll try to take a look sometime this week.

nickva commented 2 years ago

I looks like had forgotten deallocate the future in the resource destructor

nickva commented 2 years ago

Wonder if this would fix it https://github.com/apache/couchdb-erlfdb/pull/50 ?

leonardb commented 2 years ago

@nickva I'll patch my local branch and test now

leonardb commented 2 years ago

It seems to resolve the mutex issue.

Just ran into a bit of oddness when I tried to load it up a bit though. (using same funs from earlier)

11> [[begin K = integer_to_binary(X,32), erlfdb:transactional(Db, fun(Tx) -> erlfdb:wait(erlfdb:get(Tx, K)) end) end || X <- lists:seq(10,1000)] || _Y <- lists:seq(1,500)].
../include/internal/ethr_mutex.h:656: Fatal error in ethr_mutex_lock(): Invalid argument (22)
                                                                                             Aborted (core dumped)
leonardb commented 2 years ago

@nickva Thank you for the quick patch. I've been running this on one of my nodes in production for 30 minutes to test and it looks good so far.

       erlfdb_nif =>
           #{binary => {124,132,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_mutex => {117,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tid => {0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {152,3547,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_internal => {0,0,0,124,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
nickva commented 2 years ago

11> [[begin K = integer_to_binary(X,32), erlfdb:transactional(Db, fun(Tx) -> erlfdb:wait(erlfdb:get(Tx, K)) end) end || X <- lists:seq(10,1000)] || _Y <- lists:seq(1,500)]. ../include/internal/ethr_mutex.h:656: Fatal error in ethr_mutex_lock(): Invalid argument (22) Aborted (core dumped)

This is an interesting one. Looks like a use after free perhaps?. Is it easy to reproduce? Wonder if it is a resource reference counting bug: a destructor is called, we are deallocating, then some thread tries to lock it. The error comes from:

static ETHR_INLINE void
ETHR_INLINE_MTX_FUNC_NAME_(ethr_mutex_lock)(ethr_mutex *mtx)
{
    int res = pthread_mutex_lock(&mtx->pt_mtx);
    if (res != 0)
    ETHR_FATAL_ERROR__(res);
}
leonardb commented 2 years ago

@nickva While I could easily reproduce earlier, I no longer can, so guess it may have just been an artifact of a bad build which somehow caused an issue.

If I run into it again and can reproduce I'll open a separate issue.

nickva commented 2 years ago

@leonardb thanks for double-checking!

nickva commented 2 years ago

Merged the fix. If let us know @leonardb if you find any more issues!