cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

cdc: Investigate semantics of PTS management during node drain. #114545

Open miretskiy opened 11 months ago

miretskiy commented 11 months ago

Changefeed runs with on_error=pause and protect_gc_on_pause options.

It appears that the following is possible: node is being drained (this is just and example event -- maybe other events could result in this)

changefeed resumer receives an error such that resumeWithRetries returns to the caller. Now, normally, this shouldn't happen -- but some errors are propagated to the caller -- for example, if the resumeWithRetries fails to reload the job information because of any number of reasons. So, the error could be propagated to the caller.

The caller, Resumer, tries to handle the error based on the configured options:

err := b.resumeWithRetries(ctx, jobExec, jobID, details, progress, execCfg)
    if err != nil {
        return b.handleChangefeedError(ctx, err, details, jobExec)
    }

In this case, the attempt to pause should be made. However, handleChangefeed error returns error itself when it tries to pause the changefeed:

return b.job.NoTxn().PauseRequestedWithFunc(ctx, func(ctx context.Context,
  ....

There after that error should be handled by the jobs system appropriately. And, in theory, if the error was any sort of context cancleation, then job system should recongnize this and bail out. But, maybe there is something strange that could happen that would make the job system consider that the changefeed failed. In this case, OnFailOrCancel is invoked, and that method will drop PTS record -- which is not good. The job itself may not be marked as failed in job registry because (maybe) the job system may not be able to update job record with the fail status... And then changefeed will resume somewhere else, but now, there is a possiblity that the changefeed will fail because of missing PTS record.

All of this is just hypothetical. We don't know if this is what's happening, or how likely this is to happen. We need to investigate (i.e. test) this flow.

This might be related to customer issue; Informs https://github.com/cockroachlabs/support/issues/2725

Jira issue: CRDB-33552

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/cdc