Open jurgispods opened 12 months ago
Thanks @jurgispods for taking time to write the detailed example and explaining the issue. I will take sometime to go over the changes and get back in 2 weeks time. Meanwhile, would you be able to add an integration test for this change please.
@b-goyal Sure, I can add an integration test.
@b-goyal I just added an integration test that reproduces the issue (and the fix).
I found out it only shows under certain circumstances, i.e. when a schema update happens after the intermediate table has been deleted. Otherwise, schemas of destination and intermediate tables are always in sync, as they are updated using the same logic.
As far as I can seen, deletion of intermediate table only happens when the connector is stopped. So in order to replicate the error, I had to write an IT test that is quite involved:
In order to show that the connector indeed fails, I added a config for toggling my fix on or off. That might not be necessary in the final PR, as in reality, it should always be on. We could instead test that with a unit test and remove the added config.
Thanks for adding the integration test @jurgispods. Had an initial look but did not follow the root cause and resolution. Will need some more time to review this.
Hi @b-goyal, is there an update on this?
When using the connector in upsert/delete mode, it can fail under certain circumstances when the schema is updated in such a way that the intermediate table and the destination table have differently ordered nested struct fields.
Example scenario
Schema version 1
Assume the Kafka source topic has the following Avro schema (version 1):
The corresponding Bigquery destination table schema:
Schema version 2
Now, the source table schema is updated to version 2:
The problem now is that the Bigquery schemas of the intermediate and destination tables will have different orders of nested fields.
Bigquery schema of the intermediate table after creation:
Updated Bigquery destination table schema - note that the new field
maxAmount
is appended at the end:The connector will subsequently fail during the periodic merge flush:
This can be easily seen by looking at the executed MERGE queries.
Comparison of executed MERGE queries
This query will fail due to different orders of nested fields:
In contrast, this query succeeds:
We can see that for upserts, the order of struct fields matters.
Proposed changes
In this PR, I have added the destination table schema to the list returned by
SchemaManager.getSchemasList
when it is called for an intermediate table in upsert/merge mode. That way, the intermediate table schema is forced to respect the order of nested fields in the destination table schema - schema updates are simply applied on top of it, ensuring the same field order in both tables when new fields are added.Please let me know what you think of this approach.