There's a number of things going on, but a high level summary is that opening a database handle has the possibility of returning {error, all_dbs_active}, unfortunately, there are a number of internal subsystems that ignore this error and do not account for it. We get away with it sometimes by the caller crashing within a properly supervised hierarchy, which then gets restarted without major incident. However, we're not always that lucky, and there are scenarios where spawned processes that are not properly supervised crash and partially propagate the failure, leaving the system in an unknown impaired state with undefined behavior. Even worse, we don't have a quick and easy way of determining all the places that this might happen.
I'll give an example of a confirmed scenario where this bug induces a wedged system state that requires a full reboot of CouchDB to fix. Then I'll point out a handful of other areas where we're not handling all_dbs_active properly. First though, I want to talk about one of the primary sources of this issue, and that's the discrepancy within couch_util:with_db [1] where errors in opening the db handle are propagated with throw rather than returning an {error, Reason} [2], but nearly all callers look for the error tuple to be returned. There are many callers of couch_util:with_db in the codebase, and very few that expect or handle a thrown error. There are also places where the caller checks the return value looking for potential errors, but that will only encompass errors from the callback function passed to couch_util:with_db, not the actual errors from loading the database handle itself.
So let's look at one of these scenarios: when a smoosh_channel.erl gen_server process starts to compact a view, one of the first things it does is to fetch a handle to the index state process in [3], by way of couch_index_server:get_index/3. In the event there is not already a spawned index process for that database view, the smoosh_channel process will get put in a blocking gen_server:call into couch_index_server to get the index [4] with an infinity timeout. In general this is not good, we should avoid making gen_server:calls within a gen_server to prevent the process from getting wedged, which is what happened in this scenario although still not the main bug. Moving on, the couch_index_server gen_server handles the call for get_index [5] and does a few things, it spawn_links a new process to invoke couch_index_server:new_index/2, and it returns a noreply in the context of the gen_server:call from the smoosh_channel process, in theory, to return once the index has been opened. The problem is the process spawned on function new_index calls couch_index:start_link/2 [6] which utilizes couch_util:with_db in an unsafe manner in [7] by expecting the errors to propagate by return value rather than bypassing the stack with throw.
This means that in [7] the Resp = couch_util:with_db(DbName, fun(Db) -> is never actually achieved as the throw({error,all_dbs_active}) is not caught, therefore, it can never check Resp to trigger proc_lib:init_ack({error, all_dbs_active}), which then also fails to propagate into the couch_index_server:new_index logic to handle unexpected errors and return those back to the parent couch_index_server gen_server process as the error is thrown not returned in [8]. Now, the proc_lib:start_link documentation in [9] states the following:
If the process crashes before it has called init_ack/1,2, Ret = {error, Reason} will be returned if the calling process traps exits.
So perhaps the intent here was that the {error, all_dbs_active} thrown by way of couch_util:with_db would have that error propagated up in the return value so it could be caught by the worker process for new_index in [8], however, that process is spawned by the couch_index_server in [5] and while couch_index_server does indeed trap_exit, the spawned process does not, and neither does spawning the couch_index process. Therefore, this appears to crash the spawned process by way of new_index. Now in [5], the couch_index_server gen_server used a spawn_link to instantiate this worker, so it's linked, but a reference to the pid itself is not saved, nor added to the by_pid table for looking it up. The error is propagated to the handle_info callback in [10], however, the pid is not in the by_pid table and the exit Reason is not normal, so we hit the fallback that silently ignores the error in [11].
At this point we've attempted to open the index but have failed to do so, yet also failed to propagate the error properly and we've ignored the process exit. Unfortunately, we still have the smoosh_channel caller blocked by way of the noreply in [5], and because it's made a gen_server:call with a timeout of infinity, it will be blocked permanently as the worker process expected to return the data is long since dead and will never recover. Therefore the smoosh_channel will also never recover, and will be completely unresponsive as it's blocked awaiting the gen_server:call reply. Even worse, in [5], a waiters entry was established for opening this index and now future callers trying to open the index will see the waiters list and will also get stuck there indefinitely.
This results in all view requests to that shard/view/node triple getting blocked on opening an index that will never open. This also manifests as the /_system endpoint throwing timeout errors as the smoosh_channel is blocked and unresponsive, which triggers response time thresholds for generating the /_system endpoint.
You could conceivably fix this live by way of remsh, but at this point the only real recourse here is a full CouchDB reboot, especially since you do not know what other areas of the system have wound themselves into undefined behavior. We should definitely try and track down potential places where we encounter things like this. It's unfortunate that OTP doesn't have a gen_worker construct allowing short lived processes to be spawned and properly managed.
On the front of finding places where we fail to account for the {error, all_dbs_active} return value, I'm happy to say Dialyzer does a pretty solid job of tracking various instances down! I know in earlier versions of Erlang that Dialyzer has been challenging to parse, but I'm very happy to report we get some solid output from Dialyzer now.
As a quick proof of concept, I made the following change to couch_server:
and with that update alone, Dialyzer found the following callers failing to handle the {error, all_dbs_active} (and note, there are other error return types here too, I'm just focused on all_dbs_active right now):
src/couch_db_split.erl:0: in couch_db_split:split/3[57-77]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/couch_db_split.erl:0: in couch_db_split:copy_local_docs/3[78-111]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/couch_db_split.erl:0: in couch_db_split:cleanup_target/2[112-126]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/cpse_test_open_close_delete.erl:0: in cpse_test_open_close_delete:cpse_open_non_existent/1[32-38]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/cpse_test_open_close_delete.erl:0: in cpse_test_open_close_delete:cpse_open_create/1[39-45]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/cpse_test_open_close_delete.erl:0: in cpse_test_open_close_delete:cpse_open_when_exists/1[46-52]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/cpse_test_open_close_delete.erl:0: in cpse_test_open_close_delete:cpse_terminate/1[53-59]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/dreyfus_rpc.erl:0: in dreyfus_rpc:get_or_create_db/2[104-113]:The pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'} | {'ok', _}
src/mem3_rep.erl:0: in mem3_rep:go/1[106-132]:The pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'}
src/mem3_rep.erl:0: in mem3_rep:with_src_db/2[711-723]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/mem3_reshard_job.erl:0: in mem3_reshard_job:wait_source_close_impl/1[499-516]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:ensure_exists/1[294-305]:The pattern
'file_exists' can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:get_or_create_db/2[548-564]:The pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:get_or_create_db_int/2[565-582]:The pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:get_shard_props/1[589-610]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/hastings_rpc.erl:0: in hastings_rpc:get_or_create_db/2[111-120]:The pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'} | {'ok', _}
This is not an exhaustive list but it's an excellent start, and this makes me very optimistic for further Dialyzer enhancements.
I've also searched through logs and found that the all_dbs_active errors manifest in most of the different service endpoints and result in runtime errors and crashes. I need to sanitize the logs and I'm low on time today, so I'll circle back around on that later, but a quick summary is that I've found all_dbs_active crashes in the following subsystems: map views, reduce views, search queries, internal replication, security doc acquisition, doc updates, view compaction (and most likely database compaction), couch_index supervisor hierarchy crashes, all_docs queries, changes queries, and get revs.
There's a number of things going on, but a high level summary is that opening a database handle has the possibility of returning
{error, all_dbs_active}
, unfortunately, there are a number of internal subsystems that ignore this error and do not account for it. We get away with it sometimes by the caller crashing within a properly supervised hierarchy, which then gets restarted without major incident. However, we're not always that lucky, and there are scenarios where spawned processes that are not properly supervised crash and partially propagate the failure, leaving the system in an unknown impaired state with undefined behavior. Even worse, we don't have a quick and easy way of determining all the places that this might happen.I'll give an example of a confirmed scenario where this bug induces a wedged system state that requires a full reboot of CouchDB to fix. Then I'll point out a handful of other areas where we're not handling
all_dbs_active
properly. First though, I want to talk about one of the primary sources of this issue, and that's the discrepancy withincouch_util:with_db
[1] where errors in opening the db handle are propagated withthrow
rather than returning an{error, Reason}
[2], but nearly all callers look for the error tuple to be returned. There are many callers ofcouch_util:with_db
in the codebase, and very few that expect or handle a thrown error. There are also places where the caller checks the return value looking for potential errors, but that will only encompass errors from the callback function passed tocouch_util:with_db
, not the actual errors from loading the database handle itself.So let's look at one of these scenarios: when a
smoosh_channel.erl
gen_server process starts to compact a view, one of the first things it does is to fetch a handle to the index state process in [3], by way ofcouch_index_server:get_index/3
. In the event there is not already a spawned index process for that database view, thesmoosh_channel
process will get put in a blockinggen_server:call
intocouch_index_server
to get the index [4] with aninfinity
timeout. In general this is not good, we should avoid makinggen_server:calls
within agen_server
to prevent the process from getting wedged, which is what happened in this scenario although still not the main bug. Moving on, thecouch_index_server
gen_server handles the call forget_index
[5] and does a few things, itspawn_links
a new process to invokecouch_index_server:new_index/2
, and it returns anoreply
in the context of thegen_server:call
from thesmoosh_channel
process, in theory, to return once the index has been opened. The problem is the process spawned on functionnew_index
callscouch_index:start_link/2
[6] which utilizescouch_util:with_db
in an unsafe manner in [7] by expecting the errors to propagate by return value rather than bypassing the stack withthrow
.This means that in [7] the
Resp = couch_util:with_db(DbName, fun(Db) ->
is never actually achieved as thethrow({error,all_dbs_active})
is not caught, therefore, it can never checkResp
to triggerproc_lib:init_ack({error, all_dbs_active})
, which then also fails to propagate into thecouch_index_server:new_index
logic to handle unexpected errors and return those back to the parentcouch_index_server
gen_server process as the error is thrown not returned in [8]. Now, theproc_lib:start_link
documentation in [9] states the following:So perhaps the intent here was that the
{error, all_dbs_active}
thrown by way ofcouch_util:with_db
would have that error propagated up in the return value so it could be caught by the worker process fornew_index
in [8], however, that process is spawned by thecouch_index_server
in [5] and whilecouch_index_server
does indeedtrap_exit
, the spawned process does not, and neither does spawning thecouch_index
process. Therefore, this appears to crash the spawned process by way ofnew_index
. Now in [5], thecouch_index_server
gen_server used aspawn_link
to instantiate this worker, so it's linked, but a reference to the pid itself is not saved, nor added to theby_pid
table for looking it up. The error is propagated to thehandle_info
callback in [10], however, the pid is not in theby_pid
table and the exitReason
is notnormal
, so we hit the fallback that silently ignores the error in [11].At this point we've attempted to open the index but have failed to do so, yet also failed to propagate the error properly and we've ignored the process exit. Unfortunately, we still have the
smoosh_channel
caller blocked by way of thenoreply
in [5], and because it's made agen_server:call
with a timeout ofinfinity
, it will be blocked permanently as the worker process expected to return the data is long since dead and will never recover. Therefore thesmoosh_channel
will also never recover, and will be completely unresponsive as it's blocked awaiting thegen_server:call
reply. Even worse, in [5], a waiters entry was established for opening this index and now future callers trying to open the index will see the waiters list and will also get stuck there indefinitely.This results in all view requests to that shard/view/node triple getting blocked on opening an index that will never open. This also manifests as the
/_system
endpoint throwing timeout errors as thesmoosh_channel
is blocked and unresponsive, which triggers response time thresholds for generating the/_system
endpoint.You could conceivably fix this live by way of remsh, but at this point the only real recourse here is a full CouchDB reboot, especially since you do not know what other areas of the system have wound themselves into undefined behavior. We should definitely try and track down potential places where we encounter things like this. It's unfortunate that OTP doesn't have a
gen_worker
construct allowing short lived processes to be spawned and properly managed.On the front of finding places where we fail to account for the
{error, all_dbs_active}
return value, I'm happy to say Dialyzer does a pretty solid job of tracking various instances down! I know in earlier versions of Erlang that Dialyzer has been challenging to parse, but I'm very happy to report we get some solid output from Dialyzer now.As a quick proof of concept, I made the following change to
couch_server
:and with that update alone, Dialyzer found the following callers failing to handle the
{error, all_dbs_active}
(and note, there are other error return types here too, I'm just focused onall_dbs_active
right now):This is not an exhaustive list but it's an excellent start, and this makes me very optimistic for further Dialyzer enhancements.
I've also searched through logs and found that the
all_dbs_active
errors manifest in most of the different service endpoints and result in runtime errors and crashes. I need to sanitize the logs and I'm low on time today, so I'll circle back around on that later, but a quick summary is that I've foundall_dbs_active
crashes in the following subsystems: map views, reduce views, search queries, internal replication, security doc acquisition, doc updates, view compaction (and most likely database compaction), couch_index supervisor hierarchy crashes, all_docs queries, changes queries, and get revs.[1] https://github.com/apache/couchdb/blob/main/src/couch/src/couch_util.erl#L558-L575 [2] https://github.com/apache/couchdb/blob/main/src/couch/src/couch_util.erl#L566-L567
[3] https://github.com/apache/couchdb/blob/main/src/smoosh/src/smoosh_channel.erl#L476-L478 [4] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L123-L124 [5] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L156-L159 [6] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L243 [7] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index.erl#L76-L105 [8] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L249-L252 [9] https://www.erlang.org/doc/man/proc_lib.html#start_link-5 [10] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L197-L211 [11] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L208-L209 [12] https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L160-L162