confluentinc / kafka-connect-bigquery

A Kafka Connect BigQuery sink connector
Apache License 2.0
1 stars 1 forks source link

Support for Storage Write API #288

Open SamoylovMD opened 1 year ago

SamoylovMD commented 1 year ago

The official BigQuery documentation states:

For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics. The tabledata.insertAll method is still fully supported.

It looks like the currently used write method can be gradually decommissioned in the not-so-far future. Also, the new write method spends twice less money than the old one.

Here is the documentation for Storage Write API.

Do you have any work on this in the roadmap, and if no, how should the community request it?

ekapratama93 commented 1 year ago

Supporting this API is good idea since bigquery now support auto merging cdc data so it does not require to use tmp table.

https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality

jrkinley commented 1 year ago

Thoughts on adding a new BigQueryWriter implementation for the new Storage Write API that can be enabled in configuration? As opposed to modifying the existing writers AdaptiveBigQueryWriter and SimpleBigQueryWriter that use the legacy api tabledata.insertall?

james-johnston-thumbtack commented 1 year ago

Supporting this API is good idea since bigquery now support auto merging cdc data so it does not require to use tmp table.

https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality

This seems really compelling because it would greatly simplify the operation of the connector when upsertEnabled or deleteEnabled is set. Reading the blog, it sounds like they have essentially abstracted the same MERGE operations behind the new API, so the connector doesn't have to do it any more: the max_staleness value sounds an awful lot like mergeIntervalMs!

sendbird-sehwankim commented 8 months ago

They used to have Storage Write API feature on this release (https://github.com/confluentinc/kafka-connect-bigquery/tree/v2.6.0-rc-2faec09), but later they removed this feature. Now, on Confluent Cloud there's this new connector called BigQuery Sink Connector V2 which supports only Storage Write API. This plugin is not open source, and it is only available on Confluent Cloud. I guess Confluent has no plan to add this feature to this open source plugin.

C0urante commented 7 months ago

@yashmayya @ashwinpankaj any idea if this feature will be limited to Confluent Cloud only, or if it will be merged+released to this open source repo as well?

jvigneshr commented 7 months ago

We do not plan to open-source the new connector in the near future.

C0urante commented 7 months ago

@jvigneshr the connector is already open source (check the license file); I'm guessing you mean you no longer plan to maintain this project or at least release new features for it?

jvigneshr commented 7 months ago

Apologies. I meant the new BigQuery V2 connector. (Edited the previous comment too)

C0urante commented 7 months ago

Nobody's asking about V2. Are you or are you not still maintaining this project?

magnich commented 7 months ago

We are still supporting the project. The Storage Write API is a completely new API without a valid migration path from the existing connector and was therefore built as a new connector in cloud with other new features such as OAuth 2.0, support for schema context and reference subject naming strategy.

If we end up building a self-managed version of the connector it will be open source.

C0urante commented 7 months ago

Uh huh. So if someone else implemented support for the Storage Write API with this project it'd be reviewed and merged in a reasonable timeframe? (I personally doubt that the API is completely incompatible with this connector, and even if it is, a 3.0.0 release with some compatibility-breaking changes several years after 2.0.0 is completely reasonable.)

criccomini commented 7 months ago

@jvigneshr @b-goyal can y'all weigh in on the issues you had with your 2.6 code? Would love to know what needs to be done to make it work for this connector.

ragepati commented 7 months ago

Our initial approach was to replace the insertAll API with the Storage Write API in the existing BQ connector code base. During development, we observed certain incompatibilities between these two APIs. Some of these were not documented by Google.

An example is in the handling of data types - a timestamp represented as a String (e.g. 2023-12-15 13:14:15) can be ingested successfully to a DATETIME BigQuery type with the insertAll API, but not with Storage Write API (unable to parse text). It was not feasible to identify all such differences since the insertAll API has not documented the set of literal values that it can cast to a DATETIME.

We learned from Google that the newer API is not 100% backward compatible and there is no insertAll to Storage Write API migration guide.

So we built the v2 connector in Cloud as a new plugin type, along with other new features, and making it explicit to customers that there are incompatible changes between the two versions. We also documented the supported data types for the Storage Write API.

C0urante commented 7 months ago

I've just personally verified that, using a JsonStreamWriter without an explicitly-specified schema, the Java string 2023-12-15 13:14:15 can be successfully written to a DATETIME column.

That aside, if Storage Write API support were added as an opt-in feature (which, based on the reverted commits in #357, appears to have been the plan), surely these kinds of incompatibilities wouldn't be a problem?

At this point it seems like the motivation here is more to get people to pay for a proprietary fork of this project instead of continuing to maintain the open source variant. I don't think anyone who's using this project on their own would agree that they need to be "protected" from small incompatibilities in data type handling by being forced to switch over to a paid alternative instead of just tweaking a boolean property in their connector config.

james-johnston-thumbtack commented 7 months ago

Definitely agree, and can confirm that as end-users, we would be fine with adjusting a few config options to opt in to the new storage write API support, and use config options to tweak data type conversions as needed.

C0urante commented 5 months ago

Heads up to everyone involved here--Aiven has decided to fork the project, pull in the code that was removed in https://github.com/confluentinc/kafka-connect-bigquery/pull/357, and begin maintaining their own version of the connector. You can find it here. We've published a 2.6.0 release that contains support for the Storage Write API as a beta feature and would be happy to get feedback from anyone interested in trying it out.

cc @SamoylovMD @ekapratama93 @jrkinley @james-johnston-thumbtack @LarsKlingen @andrelu @Ironaki @agoloborodko @sendbird-sehwankim @quassy @aakarshg @whittid4 @corleyma @criccomini @bakuljajan

@magnich @ragepati @jvigneshr Feel free to close this issue if you have no plans on addressing it. It'd be nice to give people a clear signal about which fork they should contribute to/utilize if the Storage Write API is a priority to them.

jvigneshr commented 4 months ago

When we release a self-managed version of the BQ v2 connector, we will make it open-source. It is on the product roadmap, but we don't have a timeline to share yet.

C0urante commented 4 months ago

Then this issue should be closed, since you have no plans of addressing it on this project.