Open sgcisco opened 7 months ago
Thanks for raising this @sgcisco . I noticed you are using compact num.delta commits as 1. Any reason for the same. If we need to compact after every commit, then better we use COW table itself. One other reason may be the ingestion Job is starved of resources as async compact job may be consuming. Did we analysed spark UI. Which stage is started taking more time.
@ad1happy2go thanks for your reply. We tried compact num.delta commits as 1
in one of the tests for other runs and in what try to use now it is a default value which is 5.
As another test attempt we tried to run a pipeline over several days but with lower ingestion rate 600Kb/s and the same Hudi and Spark configuration as above.
The most time consuming stage is Building workload profile
which takes 2.5 - 12 min, with average around 7 min.
Over 3 days partitions latencies look as
So in this case it is around 35-40Mb per minute, current Structured Streaming minibatch, and workers can go up to 35Gb and 32 cores. Does it look as a sufficient resource config for Hudi to handle such load?
@sgcisco What is nature of your record key? Is it random id ? Building workload profile do the index lookup which is basically the join between the existing data with the incremental data to identify which records to be updated or inserted. Are you seeing the disk spill during this operation, you can try increasing the executor memory to avoid the same.
@ad1happy2go record key looks as record_keys=["timestamp", "A", "B", "C"],
. Where timestamp
is monotonically increasing in ms, A
a string with a range of some 500k values, B
is similar to A
, C
is max hundred values.
We use upsert
which is a default operation but we don't expect any updates on the inserted values.
We tried insert
but observed latencies were worse.
Increasing partitioning granularity from daily to hourly seems help to decrease latencies but not to solve the problem completely.
In this case partitioning size goes down from 100Gb to 4.7Gb.
Are you seeing the disk spill during this operation, you can try increasing the executor memory to avoid the same.
No, over 15h running job
In this case, with low ingestion rate ~600Kb/s and hourly partitions, at Spark Structured streaming Operation duration/Batch duration
grows each time towards end of the hour.
Which looks similar for latencies in written data
@sgcisco Sorry for the delayed response here. Were you able to get the root cause for the same. What was your batch interval in this case?
@ad1happy2go no, we could not figure out a working configuration. Batch interval is 1 min mentioned in my original message
It did not reveal any delays above 2 mins, where 1 min delay is always present due to Structured Streaming minibatch granularity.
Describe the problem you faced
Running a streaming solution with Kafka - Structured Streaming (PySpark) - Hudi (MOR tables) + AWS Glue+S3 we observed periodically growing latencies on data availability at Hudi. Latencies were measured as difference between data generation
timestamp
and_hudi_commit_timestamp
and could go up to 30 min. Periodical manual checks for the latest available data pointstimestamps
, by running queries as described here https://hudi.apache.org/docs/0.13.1/querying_data#spark-snap-query, confirmed such delays.In case of using Spark with Hudi data read-out from Kafka had unstable rate
To exclude impact from any other components but Hudi we ran some experiments with the same configuration and ingestion settings but without Hudi and with a direct write on S3. It did not reveal any delays above 2 mins, where 1 min delay is always present due to Structured Streaming minibatch granularity. In this case a read-out Kafka rate was stable overtime.
Additional context
What tried
We could get a target file size between 90-120Mb by downing
hoodie.copyonwrite.record.size.estimate
from 1024 to 100 and usingInline.compact=false and delta.commits=1 and async.compact=true and hoodie.merge.small.file.group.candidates.limit=20
but it did not have any impact on a latency.NUM_OR_TIME
as suggested here https://github.com/apache/hudi/issues/8975#issuecomment-1593408753 with parameters below did not help to resolve a problemCurrent settings
As a trade-off we came up to the configuration below, which allows us to have relatively low latencies for 90th percentile and file size 40-90Mb
But still some records could go up to 30 min.
However the last config works relatively well for low ingestion rates up to 1.5Mb/s with a daily partitioning
partition_date=yyyy-MM-dd/
but stops work for the rates above 2.5 Mb/s even with more granular partitioningpartition_date=yyyy-MM-dd-HH/
Expected behavior
Since we use MOR tables:
Environment Description
Hudi version : 0.13.1
Spark version : 3.4.1
Hive version : 3.1
Hadoop version : EMR 6.13
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
Hudi configuration
Spark configuration