confluentinc / kafka-connect-jdbc

Kafka Connect connector for JDBC-compatible databases
Other
1.01k stars 953 forks source link

[CC-21208] Fixed connector pause sending connector into provisioning status on CC #1348

Closed Tanish0019 closed 11 months ago

Tanish0019 commented 1 year ago

Problem

Currently, JDBC Source connector can get stuck in provisioning state in confluent cloud if a pause and restart are triggered simultaneously. Although rare, we have seen it happen a couple of time for some customers. This happens due to how taskConfig method is defined in the connector. When a connector is paused, the task target state is set to PAUSED and stop method of the connector is called. If this happens to right around a restart which can happen due to a config change then:

  1. connector is stopped (due to restart)
  2. Started again (due to restart)
  3. Stopped again (due to pause)

When connector is in between state 2 and 3, taskConfigs method inside the connector is called which will return 0 tasks from here . No tasks means connector is in provisioning. Even if you restart the connector now, the target state of the connector will remain paused and new tasks won't initialize as they are blocked in this section

The only way to get it back to working is to resume the connector but as connector goes into provisioning, the resume button is not available to the customer.

Solution

A simple solution would be to return 1 task with empty TABLE_CONFIG, however this will lead to task failure each time the connector starts till all tables are fetched from the DB. This call is quite slow and can take upto a minute so connector will stay in failed state which is a pretty bad experience.

SourceTaskContext can also not be used because update needs to happen in the connector class inside the taskReconfig method where we don't have access to it.

This PR addresses this by adding a new config for tasks - TABLES_FETCHED. This signifies if the first call to fetch tables has finished or not. Even when call is not finished we will create a single task but that task will not do anything in the start and poll method. When the tables are fetched from the database the task will get reconfiged and start working as usual.

Does this solution apply anywhere else?
If yes, where?

Test Strategy

Need to test on CC once.

Testing done:

Release Plan