apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.24k stars 2.39k forks source link

The BigQuerySyncTool can't work well when the hudi table schema changed [SUPPORT] #10829

Closed steve-xi-awx closed 4 months ago

steve-xi-awx commented 4 months ago

To Reproduce

Steps to reproduce the behavior:

  1. use BigQuerySyncTool to sync hudi table into BQ as a external table with connection id
  2. update the hudi table schema, then when we try to sync the hudi table metadata to BigQuery table, the error is occurred.

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error. error when sync to BigQuery, will ignore this error and continue to process next batch, error is An error occurred while calling o2983.syncHoodieTable. : com.google.cloud.bigquery.BigQueryException: Schema can be specified only on the Table.Schema field for external tables with an associated connection_id but schema was provided on Table.Externaldataconfig.Schema. at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115) at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.patch(HttpBigQueryRpc.java:271) at com.google.cloud.bigquery.BigQueryImpl$15.call(BigQueryImpl.java:673) at com.google.cloud.bigquery.BigQueryImpl$15.call(BigQueryImpl.java:670) at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103) at com.google.cloud.RetryHelper.run(RetryHelper.java:76) at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) at com.google.cloud.bigquery.BigQueryImpl.update(BigQueryImpl.java:669) at org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient.updateTableSchema(HoodieBigQuerySyncClient.java:206) at org.apache.hudi.gcp.bigquery.BigQuerySyncTool.syncTable(BigQuerySyncTool.java:147) at org.apache.hudi.gcp.bigquery.BigQuerySyncTool.syncHoodieTable(BigQuerySyncTool.java:111) at sun.reflect.GeneratedMethodAccessor817.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.sendCommand(ClientServerConnection.java:244) at py4j.CallbackClient.sendCommand(CallbackClient.java:384) at py4j.CallbackClient.sendCommand(CallbackClient.java:356) at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:106)

ad1happy2go commented 4 months ago

@steve-xi-awx Thanks a lot for raising this. What kind of schema change is happening? Can you post the writer configuration and big query sync configuration? I tried to add a new column, and it ran without any exception. Can you check code and let me know in case I am missing anything. You can also take this code and then try to reproduce with sample dataset.

https://gist.github.com/ad1happy2go/17b32db63f68b49813c8430967a99ec8

steve-xi-awx commented 4 months ago

I have raised a mr for this issue and it seems the change can fix that issue. https://github.com/apache/hudi/pull/10830 I think this problem is caused by that the external table in BigQuery with a connection id should specify the table schema in the wrong position. Your sample did't specify the connection id so that the table is still a simple external table, not a Big Lake table. This problem is occurred in release-0.14.1.

steve-xi-awx commented 4 months ago

@ad1happy2go Can you help review this mr ?

danny0405 commented 4 months ago

@ad1happy2go Can you help review this mr ?

I will take it.

ad1happy2go commented 4 months ago

Thanks @steve-xi-awx for the fix. Thanks @danny0405 . Tracking JIRA - https://issues.apache.org/jira/browse/HUDI-7488