Changefeed runs with on_error=pause and protect_gc_on_pause options.
It appears that the following is possible: node is being drained (this is just
and example event -- maybe other events could result in this)
changefeed resumer receives an error such that resumeWithRetries returns to the caller.
Now, normally, this shouldn't happen -- but some errors are propagated
to the caller -- for example, if the resumeWithRetries fails to reload the job information
because of any number of reasons. So, the error could be propagated to the caller.
The caller, Resumer, tries to handle the error based on the configured options:
There after that error should be handled by the jobs system appropriately. And, in theory, if the error
was any sort of context cancleation, then job system should recongnize this and bail out. But, maybe
there is something strange that could happen that would make the job system consider that the changefeed
failed. In this case, OnFailOrCancel is invoked, and that method will drop PTS record -- which is not good.
The job itself may not be marked as failed in job registry because (maybe) the job system may not be able
to update job record with the fail status... And then changefeed will resume somewhere else, but now, there is
a possiblity that the changefeed will fail because of missing PTS record.
All of this is just hypothetical. We don't know if this is what's happening, or how likely this is to happen.
We need to investigate (i.e. test) this flow.
Changefeed runs with on_error=pause and protect_gc_on_pause options.
It appears that the following is possible: node is being drained (this is just and example event -- maybe other events could result in this)
changefeed resumer receives an error such that
resumeWithRetries
returns to the caller. Now, normally, this shouldn't happen -- but some errors are propagated to the caller -- for example, if the resumeWithRetries fails to reload the job information because of any number of reasons. So, the error could be propagated to the caller.The caller, Resumer, tries to handle the error based on the configured options:
In this case, the attempt to pause should be made. However, handleChangefeed error returns error itself when it tries to pause the changefeed:
There after that error should be handled by the jobs system appropriately. And, in theory, if the error was any sort of context cancleation, then job system should recongnize this and bail out. But, maybe there is something strange that could happen that would make the job system consider that the changefeed failed. In this case, OnFailOrCancel is invoked, and that method will drop PTS record -- which is not good. The job itself may not be marked as failed in job registry because (maybe) the job system may not be able to update job record with the fail status... And then changefeed will resume somewhere else, but now, there is a possiblity that the changefeed will fail because of missing PTS record.
All of this is just hypothetical. We don't know if this is what's happening, or how likely this is to happen. We need to investigate (i.e. test) this flow.
This might be related to customer issue; Informs https://github.com/cockroachlabs/support/issues/2725
Jira issue: CRDB-33552