apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.2k stars 2.38k forks source link

[SUPPORT] Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error #11466

Open ankit0811 opened 3 weeks ago

ankit0811 commented 3 weeks ago

We are trying to create a COW table using kafka as our source and s3 as our sink. The source comprises of a list of kafka topics. The current checkpoint happens every 2 mins and when the checkpoint starts (hudi table files commit), the flink job throws an exception

2024-06-17 23:43:03
org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'hoodie_append_write: <database_name>.<table_name>' (operator fa7c267a06b83c5e2dc3af13367ebe76).
    at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:556)
    at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$start$0(StreamWriteOperatorCoordinator.java:196)
    at org.apache.hudi.sink.utils.NonThrownExecutor.handleException(NonThrownExecutor.java:142)
    at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error
    ... 6 more
Caused by: org.apache.hudi.exception.HoodieException: Commit instant [20240618064120870] failed and rolled back !
    at org.apache.hudi.sink.StreamWriteOperatorCoordinator.doCommit(StreamWriteOperatorCoordinator.java:587)
    at org.apache.hudi.sink.StreamWriteOperatorCoordinator.commitInstant(StreamWriteOperatorCoordinator.java:542)
    at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:258)
    at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
    ... 3 more

This pipeline is the only one writing to the table so we dont have multiple writers.

Below is the config used for this

options.put(FlinkOptions.PATH.key(), hudiBasePath);
      options.put(FlinkOptions.TABLE_NAME.key(), targetTable);
      options.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.COPY_ON_WRITE.name());
      options.put(FlinkOptions.PRECOMBINE_FIELD.key(), "timestamp"); // need to fix this as last_Update is null
      options.put(FlinkOptions.IGNORE_FAILED.key(), "false");
      options.put(HoodieIndexConfig.INDEX_TYPE.key(), HoodieIndex.IndexType.GLOBAL_BLOOM.name());
      options.put(FlinkOptions.OPERATION.key(), WriteOperationType.INSERT.value());
      options.put(FlinkOptions.PARTITION_PATH_FIELD.key(), "ts_date:string");
      options.put("hoodie.parquet.small.file.limit", "104857600");
      options.put("hoodie.parquet.max.file.size", "536870912");
      options.put("clustering.schedule.enabled", "true");
      options.put("clustering.async.enabled", "true");
      options.put("hoodie.clustering.plan.strategy.max.bytes.per.group", "107374182400");
      //options.put("hoodie.clustering.plan.strategy.max.num.groups", "1");
      options.put("write.tasks", "1");

Environment Description

Based on some GH history, we did try to delete the .aux/ckp_meta/ dir, but still no luck.

Any pointers on how we go about fixing this would be much appreciated.

danny0405 commented 3 weeks ago

I see you put the option options.put(FlinkOptions.IGNORE_FAILED.key(), "false");, it looks like there is error for parquet writers which is collected back to the coordinator, so it reports error when committing.

ankit0811 commented 2 weeks ago

Hmm. I dint find any relevant errors in the tm logs.

Changed the IGNORE_KEY to true and it seems to be working but I dont see any data in the parquet files. They are all empty. Any idea how should I debug this further

ankit0811 commented 2 weeks ago

Also get this exception in the same job

22:51:20.093 [pool-286-thread-1] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to 10.3.175.136:45105 [/10.3.175.136] failed: Connection refused (Connection refused)
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:547) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:69) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getPendingCompactionOperations(PriorityBasedFileSystemView.java:257) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.table.action.clean.CleanPlanner.<init>(CleanPlanner.java:98) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:159) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:185) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.table.HoodieFlinkCopyOnWriteTable.scheduleCleaning(HoodieFlinkCopyOnWriteTable.java:359) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:752) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:862) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:835) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.sink.CleanFunction.lambda$open$0(CleanFunction.java:71) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
    at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.hudi.org.apache.http.conn.HttpHostConnectException: Connect to 10.3.175.136:45105 [/10.3.175.136] failed: Connection refused (Connection refused)
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:543) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    ... 16 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
    at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) ~[?:?]
    at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) ~[?:?]
    at java.net.AbstractPlainSocketImpl.connect(Unknown Source) ~[?:?]
    at java.net.SocksSocketImpl.connect(Unknown Source) ~[?:?]
    at java.net.Socket.connect(Unknown Source) ~[?:?]
    at org.apache.hudi.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:543) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
    ... 16 more
danny0405 commented 2 weeks ago

Changed the IGNORE_KEY to true and it seems to be working but I dont see any data in the parquet files. They are all empty. Any idea how should I debug this further

The error is catched up by the writer, and when an error occurs, it logs an error msg:

LOG.error("Error writing record " + record, t);
ankit0811 commented 2 weeks ago

The error is catched up by the writer, and when an error occurs, it logs an error msg:

LOG.error("Error writing record " + record, t);

I dont think so it was logging this any where. Tried running locally as well as on our k8 flink cluster.

Was finally able to resolve this and the culprit was data type mis match in one of the column. Not sure why this was not explicitly being logged in any of the logs tho

danny0405 commented 2 weeks ago

Maybe you should check your logger configuration.