apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.42k forks source link

[SUPPORT] Unstable Flink-Hudi connector on scale #9808

Open fenil25 opened 1 year ago

fenil25 commented 1 year ago

Tips before filing an issue

Describe the problem you faced

Our pipeline involves ingesting the changelog from Kafka into Flink and then we finally use the Hudi sink. We are observing a lot of issues with few tables. The table emits around 10K upserts per second (a huge number of updates). We are using EMR 6.11.0 with Hudi version 13.0 and Flink version 1.16.0 Initially, we tried Copy On Write (COW) but then saw lot of issues with checkpointing in Flink. The main culprit was the error - Checkpoint expired before completing. Increasing resources, decreasing checkpoint interval, increasing checkpoint timeout nothing helped. We then moved to Merge On Read table. We were still seeing issues like - java.lang.IllegalStateException: Receive an unexpected event for instant 20230912181658265 from task 7 Open hudi issue like this suggest that it's multiple writer issue. We set the writers to 1 to resolve this and then increasing checkpoint timeout from 15 minutes to 60 minutes helped. The checkpoint was taking around 30 minutes. However, the main problem we faced here was with whatever config we change for the Flink pipeline, we had to rebootstrap the table. This was not the case with CoW table. Bootstrapping is expensive for us and takes quite some time. If we do not bootstrap, then we see the error of IllegalStateException again or FileAlreadyExistsException (for log files) Our main questions are -

To Reproduce

Steps to reproduce the behavior:

  1. On EMR, setup an ingestion from MySQL Debezium changelogs emitted to Kafka and finally consumed by Flink to create a Hudi table
  2. The scale of changelogs should be around 10K upserts/sec
  3. Hudi/Flink configs
    • MOR table
    • Single writer
    • Checkpoint configuration - image
    • hudi.clean.policy=KEEP_LATEST_BY_HOURS
    • --hudi.clean.retain_hours=168
    • All other configs are just default configurations

Environment Description

Stacktrace

Stack trace of the aforementioned errors -

danny0405 commented 1 year ago

Sorry for the issue @fenil25 , the release 0.13.0 is a very buggy release, I'm wondering if you can try rellease 0.12.3 or 0.13.1 instead.

ad1happy2go commented 1 year ago

@fenil25 Were you able to try with the 0.12.3 or 0.13.1. Did you still faced this issue?