apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.35k stars 2.42k forks source link

[SUPPORT] BQ synch tool not working with HUDI bundle jar #10629

Closed masthanmca closed 7 months ago

masthanmca commented 8 months ago

Tips before filing an issue

Describe the problem you faced BQ sync is not working with hudi bundle jar A clear and concise description of the problem. I wanted to enable BQ sync while writing ingest the data into HUDI table using manifest file. To Reproduce

Steps to reproduce the behavior:

  1. create data frame with any schema
  2. use the below options for Bq sync along with the other default HUDI configurations
  3. hiveConfigs.put("org.apache.hudi.gcp.bigquery.BigQuerySyncTool", "true") hiveConfigs.put("hoodie.gcp.bigquery.sync.project_id", bqSyncProjectId) hiveConfigs.put("hoodie.gcp.bigquery.sync.dataset_name", bqSyncDatasetName) hiveConfigs.put("hoodie.gcp.bigquery.sync.table_name", hoodieHiveSyncTable) hiveConfigs.put("hoodie.gcp.bigquery.sync.dataset_location", "us") hiveConfigs.put("hoodie.gcp.bigquery.sync.source_uri", bqSyncSourceUri) hiveConfigs.put("hoodie.gcp.bigquery.sync.source_uri_prefix", bqSyncSourceUriPrefix) hiveConfigs.put("hoodie.gcp.bigquery.sync.base_path", bqSyncBasePath) hiveConfigs.put("hoodie.gcp.bigquery.sync.partition_fields", hoodieHiveSyncPartitionFields) hiveConfigs.put("hoodie.gcp.bigquery.sync.use_bq_manifest_file", "true")
  4. write the data frame in HUDI table.ds.write.format(HudiFormat).options(hoodieConfigs).options(hiveConfigs).mode(writeMode).save(location)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

No error , but external table not created in Big Query

ad1happy2go commented 8 months ago

@masthanmca Is the the first time you are facing this issue or after upgrade you started facing this one.

Your configurations also looks wrong? From where you got these or which doc you referred? can you refer - https://hudi.apache.org/docs/gcp_bigquery/

abhishekshenoy commented 7 months ago

Facing the same issue , does not work with org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 .

Hudi Write to path works , Hive Sync works but BQ sync does not work.

For now have taken this route based on a flag to manually perform the BQSync with BQSyncTool post the dataframe.write

https://github.com/apache/hudi/issues/9355#issuecomment-1696764242

ad1happy2go commented 7 months ago

@abhishekshenoy @masthanmca That (https://github.com/apache/hudi/issues/9355#issuecomment-1696764242) i.e. BigQuerySyncTool is the correct way of doing BQ sync with batch jobs.

The another way is doing this with HudiStreamer.

abhishekshenoy commented 7 months ago

@ad1happy2go @the-other-tim-brown

But should nt that be internally called when we are providing the Hudi Bq 
configs and enabling META_SYNC_ENABLED. 

In my case we use df.write.options(hudiAndHiveAndBQConfigs).save() and 
the hudiAndHiveAndBQConfigs has both hive and bq related configs . 

*But still only hive sync happens implicitly*. 

Is it by design that as part of our write function we need to perform both 

df.write.options(hudiAndHiveAndBQConfigs).save()
new BigQuerySyncTool(getBigQueryProps).syncHoodieTable()
ad1happy2go commented 7 months ago

@masthanmca @abhishekshenoy I went through the code and identified that we need to set both the class names to do both metastync together. The default value for below prop is just hive sync. I tried with 0.14.1 hudi version and after write and hive sync completed, it tried to do Big query sync also.

"hoodie.meta.sync.client.tool.class" : "org.apache.hudi.hive.HiveSyncTool,org.apache.hudi.gcp.bigquery.BigQuerySyncTool"
ad1happy2go commented 7 months ago

@masthanmca Closing out this issue as I confirmed it works. Please reopen in case you still see this issue.