When writing files into S3 after batch streaming from Kafka, it takes around 2 hours to finish the step "Tagging" while the EMR Cluster looks like being almost idle,
It looks like only two executors are doing all the tasks (I don't know if this could be an issue),
Describe the problem you faced
When writing files into S3 after batch streaming from Kafka, it takes around 2 hours to finish the step "Tagging" while the EMR Cluster looks like being almost idle,
It looks like only two executors are doing all the tasks (I don't know if this could be an issue),
This is running on AWS EMR with this setup:
It does not look like a memory problem,
spark-config
HUDI .write Options
These are the options I'm using for the HUDI write:
I've marked with "<.....>" the values I've manually replaced for privacy reasons.
Versions
Hudi version :
Spark version : 3.2.1
Hive version : hudi-spark3.2-bundle_2.12:0.11.0
Hadoop version : org.apache.hadoop:hadoop-aws:3.2.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Thank you for any insights, and let me know if you require any extra information 🙂