Open chewbranca opened 2 months ago
Looks like dreyfus_rpc
does the right thing and cleanup the Workers
in the outer after clause https://github.com/apache/couchdb/blob/main/src/dreyfus/src/dreyfus_fabric_search.erl#L147 and it looks like that's the full list of workers too, not just the winning shard range workers. I suggest, at a minimum, we follow the same pattern from dreyfus_rpc
and do cleanup on the full set of workers in the after clause.
I say "at a minimum" because I think we should consider moving the cleanup to the dedicated rexi_mon
process such that if the coordinator process dies it'll still have the workers cleaned up. This is definitely a secondary concern compared to the main source of stranded workers in this ticket, but still worth considering.
Good finds @chewbranca! Clearly there is something broken here and we should fix it. Thanks for the detailed analysis!
we should consider moving the cleanup to the dedicated rexi_mon process
For streams we already have a cleanup process spawned for every streaming request https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_streams.erl#L47. We should see why that doesn't clean up the workers and lets them timeout instead.
Perhaps it's too cautious to avoid sending unnecessary kill messages? It tries to use the rexi_STREAM_CANCEL
which makes the worker exit normal
, instead of killing it to avoid generating sasl generate sasl logs. But perhaps that won't happen as those workers are not gen_servers?
Recently we also added a kill_all command to aggregate kill commands per node, so instead of sending one per shard, it's one per node with a list of refs, maybe that's enough to keep the overhead of the extra kills fairly low.
Another thing to keep it mind is that we don't always want to kill the workers, at least in the update docs path we specifically allow them to finish updating to reduce the pressure on the internal replicator.
Looks like dreyfus_rpc does the right thing and cleanup the Workers in the outer after clause
Dreyfus doesn't use the streams facility, so likely has a slightly different way to doing cleanup. There is also the complication of replacements if they are spawned, those have to be cleaned up as well. However if we do a blanket kill_all
for all the workers then it should take care of that, too. But, it would nice to see what corner cases we're missing currently. Which errors are generated and if it's triggered by some error or just a race condition...
Do you have a easily reproducible scenario to test it out? Start a 3 node cluster and issue a bunch of _all_docs calls?
Having failed to reproduce this locally so moved on to investigate on a cluster where this error happens regularly.
Found a cluster where exit:timeout
stream init timeout errors happen up to 4000 times per minute. Noticed most of them are not generated by an error in the coordinator or the workers. The processes will generate those are calls to fabric:design_docs/1
from the ddoc cache recover logic. The calls seem to not generate any failures except the left-over workers in the stream_init state, waiting or stream start/cancel messages, which was rather baffling at first.
However after a more thorough investigation, the reason for that is that design docs are updated often enough that the ddoc cache is quickly firing up and immediately kill the fabric:design_docs/1
process. There is nothing to log an error and since these are not gen_servers registered with SASL they don't emit any error logs, as expected.
In general, we already have a fabric_streams mechanism to handle the coordinator being killed unexpectedly. However tracing the lifetime of the fabric:design_docs/1
processes, the coordinator is often killed before it gets a chance to even start the auxiliary cleanup process. The current pattern is something like this:
We submit the jobs:
Then we spawn the cleanup process:
Those may seem like they would happen almost immediately, however tracing the init_p
call on the workers side, and trying to log the process info of the caller (coordinator), by the time the init_p
function is called, the coordinator is already dead. Since we never spawned the cleaner process yet, there is nothing to clean up these workers.
On the positive side, these workers don't actually do any work, they just wait in a receive clause, albeit with an open handle Db handle which is not too great.
To fix this particular case we have to ensure the cleaner process starts even earlier. By the time the coordinator submits the jobs the cleanup process should be up and waiting with the node-ref tuples ready to clean them up.
So far in production we noticed most of the cases of exit:timeout
errors generated by rexi:init_stream
came from quick killing of design doc fetches from ddoc cache. That should be fixed by https://github.com/apache/couchdb/pull/5152. However, the analysis above is also correct that we do not clean up workers on error or timeouts. Except for a few expected error types only:
https://github.com/apache/couchdb/blob/a2241d36621e6bee101aad0d1bf19e52de1be3aa/src/fabric/src/fabric_streams.erl#L168-L171
In this PR we improve cleanup and perform cleanup for all stream start errors, including timeouts.
While trying to understand why we'd encounter
rexi:init_stream
errors in https://github.com/apache/couchdb/issues/5122 I believe I've identified a pattern present in at least four of the fabric RPC related modules. I thinkfabric_view_all_docs.erl
is a relatively straightforward representation of the issue, so I'm going to dissect the flow from there.Step 1) Instantiate RPC workers
We first create a set of RPC workers on the remote nodes as specified in
Shards
. This creates the handleWorkers0
with a set of references to all instantiated RPC workers.Step 2) create a set of monitors for all remote nodes
This creates a set of monitors on the relevant remote rexi processes for each of the nodes in question, not the workers themselves:
Step 3 handle
fabric_streams:start
in atry ... after .... end
blockThis invokes
fabric_streams:start
in atry
block so thatafter
we invokerexi_monitor:stop(RexiMon)
to clear out the monitors.Step 4) handle the inner case clauses of Step 3)
First off we have the successful case when the stream has been initialized:
The key thing of note here is that this clause performs a
fabric_streams:cleanup(Workers)
in theafter
clause of atry
block to ensure the remote workers are cleaned up after the job is done.However, the cleanup is performed against the subset of workers selected to perform the job in
Workers
, not the original full set of RPC workers instantiated and stored inWorkers0
.Next we have the two failure cases for this fabric operation. I'll lump them together as their behavior is identical:
Both of these failure clauses bubble up the error through the caller provided
Callback
, however, neither performs any cleanup of the workers. In the outerafter
clause we do arexi_monitor:stop(RexiMon)
but that's basically a no-op to kill the dedicated monitoring process.Core Issue
I think there are two things going on here we need to address:
1) RPC workers are not cleaned up at all upon
fabric_streams:start
error modesI think this is fairly straightforward here, we should always ensure workers are cleaned up, especially when failures happen. Basically I think we should do a
fabric_streams:cleanup
on the workers in the outerafter
clause so they're always cleaned up.2) when we do call
fabric_streams:cleanup(Workers)
it's onWorkers
instead ofWorkers0
This might be a bit more controversial, but I suspect one of the ways in which https://github.com/apache/couchdb/issues/5122 manifests is because we're not diligent about canceling RPC workers. It's possible that
fabric_streams:cleanup(Workers)
is sufficient, but I thinkfabric_streams:cleanup(Workers0)
against the full original set of workers is appropriate.3) bonus item: we should consider moving the cleanup logic to the rexi_mon monitor
The core rationale here is that
after
clauses do not trigger when a process is killed, leaving the possibility of remote zombied RPC workers. In theory the remote nodes'rexi_server
processes should get a process down notification? Again, perhaps that's sufficient, I'm personally inclined to do double bookkeeping in these types of scenarios, where we monitor from the RPC and also send out a kill signal from the coordinator side. What do folks think?Presence in the codebase
Right now I think I've identified this pattern in the four following fabric modules, although I've not done a full audit of the other modules so there may be more instances of this: