Proposal to remove RecoveringIndexer in favor of a supervisor

SamirTalwar commented 4 years ago

I believe the RecoveringIndexer class, used by the StandaloneIndexerServer, is doing too much.

Currently, RecoveringIndexer catches exceptions thrown by the indexer, at which point it waits 10 seconds (by default) and restarts it, ad infinitum. i don't think this behavior belongs here.

As far as I can see, there are 5 different situations in which the StandaloneIndexerServer is used:

in production, running an indexer against a ledger service
in development, running an indexer against a local, persisted ledger service
in development, running an indexer against a local, in-memory ledger
(eventually) as the Sandbox, computing its index against a PostgreSQL ledger
(eventually) as the Sandbox, computing its index against an in-memory ledger

The Sandbox doesn't use StandaloneIndexerServer right now, but we expect it to do so in the near future.

In the production case, we expect the operator to be running the participant in a fault-tolerant environment; i.e. if the process crashes, it is logged and restarted. This includes environments such as Kubernetes or other schedulers, but also Docker, supervisord, supervise, etc. Restarting the indexer in-process will often work, but might also have negative effects that are outside the control of the operator, for example:

if the connection to the ledger is down, we may want to avoid restarting the indexer until it comes back up;
uptime metrics might be falsely optimistic, because the supervisor/scheduler can't track restarts within the process;
tracking failures as a result of a botched deployment becomes much harder;
automated rollback on deployment failure might not work, because it's not obvious that the deployment failed, even on a broken connection;
etc.

In development, we can probably assume that the developer will want to know when something's gone wrong as soon as possible. The most effective way to do this is to crash the process (or, in automated tests, cause the tests to fail with an exception). In this case, the RecoveringIndexer just gets in the way, causing timeouts in tests and making it look like things are working well when there are spurious failures.

When running in development against a real ledger (even PostgreSQL), ledger connectivity failures could be considered beyond the scope of developing the participant; if my ledger goes down, I want to know about it.

When developing logic against an in-memory ledger, if there's a failure in the indexer, I have bigger problems, and I should probably address those ASAP.

For all these reasons, I believe that maintainers and operators of a ledger participant would be better served by an indexer that throws an exception on failure, eventually leading to a crash in a production environment.

I'd love to hear others' thoughts on this and reasons why this is a bad idea. :-)

oliverse-da commented 4 years ago

At some point, Canton was planning to rely on the RecoveringIndexer to restart "completed akka source", but currently has a work-around in place and a plan to move to a solution that would also not need RecoveringIndexer anymore.

However during that time, I tried thinking of production operability reasons for RecoveringIndexer "resubscribes" (not due to error but normal operation) and came up with: What if the underlying X/ledger is taken down for offline maintenance and upon shutting down completes the akka source? Wouldn't it be nice if the Indexer would automatically pick up after the X/ledger is brought up and once it is able to resubscribe? Not sure how compelling this is in practice and whether it is consistent with best practice to build systems resilient to fat-fingering (human operator mistakes).

stefanobaghino-da commented 4 years ago

I'm very much in favor of this proposal, I believe that delegating the operational concern to the operator's choice make way more sense than trying to address this in process. In a production environment a savvy operator would anyhow have to deal with this to recover from the possibility of a JVM fatal error, so framing this as an operational issue that must be dealt with by the operator would also leave it with a better control over how different failures are handled.

mziolekda commented 4 years ago

I agree with the proposal of removing the recovering indexer and delegating the problems associated with closed akka source to the operational concerns. What we need as well though is very clear indication what kind of errors are worthy of closing the akka source and which should be recovered in the source implementation. At the moment theh documentation does not mention that: https://github.com/digital-asset/daml/blob/e13f9a7edd784050b5e3c08b630dba70eef0a0fd/ledger/participant-state/src/main/scala/com/daml/ledger/participant/state/v1/ReadService.scala#L168

ghost commented 4 years ago

I agree. I'm not sure there's any kind of error that can be handled well at this point though.

Let's split the types of errors into, let's say, software bugs, data errors, system failures, connectivity failures, and deployment/maintenance mix-ups.

Software bugs should be flagged as soon as possible. Recovering will just make it harder to spot them. If they're persistent, recovery will just trigger the same bug over and over; if they're flaky, then real bugs might be masked.

Data errors can't be fixed by recovery. If there's a problem, we need to fix it upstream. The only reasonable thing to do here is to stop operating until the issue is fixed.

System failures, such as running out of memory or disk space, will just be made worse by a refusal to crash.

Connectivity failures might be remedied by recovery; it depends if the issue is transient or persistent. If it's a transient failure, it makes sense, but if it's persistent, it'll just be annoying and will disguise the issue (described above). Transient failures will be handled just as well by restarting the process. As we can't tell from within the process, I think it makes sense to let the scheduler handle this.

Deployment/maintenance mix-ups, such as deploying the wrong version, upgrading in the wrong order, or pointing to the wrong database, cannot be solved with a restart. It's an error caused by the operator (not going into the specifics of what the root cause might be) and needs to be fixed by the operator.

I think the only one of these where a restart makes sense is a transient connectivity failure. However, from inside the process, we cannot tell what's going on, and so I would consider it fairly arrogant to assume we can fix it from within.

digital-asset / daml

Proposal to remove RecoveringIndexer in favor of a supervisor #4067