changefeedccl: clarify changefeed retry behavior

CHANGEFEEDS send data to external systems (sinks) over the network. Over the life of a changefeed, it is likely that attempts to send data to a sink will fail with network errors, cloud permissions problems, server-level errors from the given sink, and more. For most errors, we want to retry these external requests as the issue is most likely transient. Currently, we have 3 ways in which a changefeed may be retried:

Retries inside the implementation of the sink
Retries inside the "Resume" method of our changefeed Resumer: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/changefeed_stmt.go#L581-L590
Retries inside the job system itself (it is unclear to me which cases will hit this, given the retry loop inside the resumer).

We've hit a number of problems recently with our current retries:

Likely fatal errors are not raised to the user: https://github.com/cockroachdb/cockroach/issues/62556 https://github.com/cockroachdb/cockroach/issues/43216
Likely transient errors result in a fatal failure: https://github.com/cockroachdb/cockroach/issues/63317
We have very little insight into the retries happening inside the sinks: https://github.com/cockroachdb/cockroach/issues/49708
The cloud sinks have very different retry strategies. This makes it hard to describe and document the current retry behavior. Further many of the default retry strategies have proven to be not particularly useful for the CDC use case.
The retry behaviour is not configurable.

Addressing all of these issues will likely require a number of changes. In a recent conversations, we discussed some initial improvements we could make:

Support caller-provided retry policies in the cloud storage implementations: https://github.com/cockroachdb/cockroach/issues/64645
Improve the top-level changefeed retry behaviour to more aggressively backoff and perhaps stop retrying if no forward progress has been made and some maximum number of retries have been attempted.
Arrange for all errors from changefeeds to be retriable by default.

From there, we can improve our ability to recognise and specifically mark fatal errors as fatal. Improving retries in the kafka sink may further require that we rely less on the sarama library.

Jira issue: CRDB-7181

cockroachdb / cockroach

changefeedccl: clarify changefeed retry behavior #64646