basho / leveldb

Clone of http://code.google.com/p/leveldb/
BSD 3-Clause "New" or "Revised" License
408 stars 182 forks source link

Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

Open bashoshaun opened 7 years ago

bashoshaun commented 7 years ago

A customer occasionally runs into a problem where their node will not start back up after it's been shutdown to perform a backup.

On startup, they see this kind of output in console.log:

2016-12-11 01:14:47.674 [error] <0.1611.0>@riak_kv_vnode:init:512 Failed to start riak_kv_eleveldb_backend backend for index 1381575766539369164864420818427111292018498732032 error: {db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}
2016-12-11 01:14:47.678 [notice] <0.1611.0>@riak:stop:43 "backend module failed to start."
2016-12-11 01:14:47.678 [error] <0.1611.0> gen_fsm <0.1611.0> in state started terminated with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184
2016-12-11 01:14:47.678 [error] <0.1611.0> CRASH REPORT Process <0.1611.0> with 1 neighbours exited with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in gen_fsm:terminate/7 line 600
2016-12-11 01:14:47.679 [error] <0.163.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.1611.0> exit with reason no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in context child_terminated
2016-12-11 01:14:47.679 [error] <0.160.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.1585.0> exit with reason {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1184}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,908}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_fsm,sync_send_event,[<0.1611.0>,wait_for_init,...]}} in context child_terminated
2016-12-11 01:14:47.683 [info] <0.500.0>@riak_kv_app:prep_stop:232 Stopping application riak_kv - marked service down.

I've attached the full console.log file for context.

The customer has also implemented some basic debugging in their startup script as it was initially thought a process may have been holding a lock on these console.log.txt

files. This is an example of what they looked like when the node was up and running normally:

lsof:

...
beam.smp  177432           riak  mem       REG              253,8      32526336 1744834583 /data/riak/leveldb/1267395951122892374379757940871151681107879002112/000369.log
beam.smp  177432           riak  mem       REG              253,8      27365376 1879049262 /data/riak/leveldb/68507889249886074290797726533575766546371837952/000366.log
beam.smp  177432           riak  mem       REG              253,8      27631616  805307404 /data/riak/leveldb/125597796958124469533129165311555572001681702912/000367.log
beam.smp  177432           riak  mem       REG              253,8      32116736 1476398089 /data/riak/leveldb/1438665674247607560106752257205091097473808596992/000382.log
beam.smp  177432           riak  mem       REG              253,8      32694272 1476398103 /data/riak/leveldb/1039036320289938793410432185759232459286639542272/000382.log
...

After the stop, there are no leveldb entries in the output of lsof. Similarly, there are no beam processes running.

So when Riak restarts, there's no obvious reason why there should be a failure. When the customer then manually attempts a restart after realising the node is down, it comes up fine.

Customer is running RHEL7.

bashoshaun commented 7 years ago

The customer would like to add the following:

Perhaps it could also mention that while no process has the LOCK files open after riak stop, they do persist in the file system, and also that the file system is xfs on thinly provisioned lvm logical volume, which is used to create a snapshot during the node stop.