Open jiangxinqi1995 opened 1 year ago
Did you see any error stack trace in the logging then, what time interval the ckp triggers with?
This is my Flink checkpoint configuration. I don't see any other useful log information
Your ckp options seem good, what is your partition path field then? How many partitions do you estimate that can be touched for one ckp write operation?
Partitions by day, only triggers the current day's partition each time ``
I don't know why. At first, it is possible to synthesize Parquet files, but after synthesizing two Parquet files, subsequent files will not be merged, and there will be errors such as org. apache. link. runtime. checkpoint. CheckpointException: Checkpoint expired before completing
or Exceeded checkpoint tolerable failure threshold.
Every time a Flink task is restarted, every two checkpoints are performed, and the next checkpoint will fail. This phenomenon is very strange.
Sonds like a env problem or record payload issue.
It may be an environmental issue because it always performs a compression operation and then never performs a compression operation again
It may be an environmental issue because it always performs a compression operation and then never performs a compression operation again
Did the compaction succeed or failed ?
Once successful, a parquet file was generated, and after a period of time, a checkpoint exception occurred. Thereafter, no parquet file was generated, only log files were written, accompanied by checkpoint failures.
What kind of env did you run the Flink job on?
Flink job is deployed in AWS EKS,with a jobmanager and a taskmanager
Flink images are self built,this is my Dockerfile
FROM flink:1.15.3-scala_2.12-java11
WORKDIR /opt/flink
RUN wget -q -O /opt/flink/lib/hudi-flink1.15-bundle-0.13.0.jar https://repo.maven.apache.org/maven2/org/apache/hudi/hudi-flink1.15-bundle/0.13.0/hudi-flink1.15-bundle-0.13.0.jar && \
wget -q -O /opt/flink/lib/hadoop-aws-3.2.1.jar https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar && \
wget -q -O /opt/flink/lib/aws-java-sdk-bundle-1.11.874.jar https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar && \
wget -q -O /opt/flink/lib/mysql-connector-java-8.0.27.jar https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar && \
wget -q -O /opt/flink/lib/flink-connector-jdbc-1.15.3.jar https://repo1.maven.org/maven2/org/apache/flink/flink-connector-jdbc/1.15.3/flink-connector-jdbc-1.15.3.jar && \
mkdir /opt/flink/plugins/s3-fs-hadoop && \
cp /opt/flink/opt/flink-s3-fs-hadoop-1.15.3.jar /opt/flink/plugins/s3-fs-hadoop/ && \
wget -q -O /opt/hadoop-3.2.1.tar.gz https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz && \
tar -zxvf /opt/hadoop-3.2.1.tar.gz -C /opt/ && \
rm /opt/hadoop-3.2.1.tar.gz
ENV HADOOP_HOME=/opt/hadoop-3.2.1
ENV HADOOP_COMMON_HOME=$HADOOP_HOME
ENV PATH=$PATH:$HADOOP_HOME/bin
ENV HADOOP_CLASSPATH=${HADOOP_HOME}/etc/hadoop:${HADOOP_HOME}/share/hadoop/common/lib/*:${HADOOP_HOME}/share/hadoop/common/*:${HADOOP_HOME}/share/hadoop/hdfs:${HADOOP_HOME}/share/hadoop/hdfs/lib/*:${HADOOP_HOME}/share/hadoop/hdfs/*:${HADOOP_HOME}/share/hadoop/mapreduce/lib/*:${HADOOP_HOME}/share/hadoop/mapreduce/*:${HADOOP_HOME}/share/hadoop/yarn:${HADOOP_HOME}/share/hadoop/yarn/lib/*:${HADOOP_HOME}/share/hadoop/yarn/*
ENV CLASSPATH=$CLASSPATH:$HADOOP_CLASSPATH
I see, I guess it is because of the resource, did you try allocating more resource to the job here, the slot/memory/parallelism.
It's sound like the ckp interval ,try to decrease the ckp interval!
I think it's because I set the automatic trigger for the flink savepoint, which causes checkpoint failures after each savepoint. I removed the automatic trigger for the savepoint, and now everything is normal. I think it's because of this problem, but I don't understand why this situation occurs. Has it happened before? Thank you. @danny0405 @c-f-cooper
Hav't learnt about the details of auto savepoint, guess it's caused by the preemption of resources, both the compaction and savepoint have rigorous requirements for resource.
I did a test, and there was a 10 minute interval between savepoint and checkpoint. After triggering savepoint, the checkpoint still failed because the resources were sufficient at that time
I am facing the same problem, tunning up resources could resolve the issue, but I am still curious about the reason behind it.
I have a pipeline running for days ATM, and the savepoint was triggered by the k8s operator every 6 hours.
4c8g x2 task managers in use with a total of 8 slots.
Every checkpoint would take around 10min, processing 100k records and affecting 20+ partitions. And every checkpoint after savepoint would result in failure: Checkpoint expired before completing
(60 min).
Could someone elaborate on what exactly is different about the process after Savepoint?
Nice findings, can you help to dig into the root cause @gfunc ?
We are experiencing this issue as well. We're also on Flink 1.15. Just curious if you've tried a newer version of Flink?
I am with Flink 1.16 and Hudi 0.13. I tried looking into the code but was not able to identify the exact task that was responsible (my example job was always stuck at the sink part).
Since we were using S3 for checkpoint backend, we turned off auto savepoint and moved on to test other features, hoping that maybe async Hudi job execution (cleaning, etc.) would help in pinpointing the issue.
my example job was always stuck at the sink part
@gfunc Is it stuck for regular writing or when the auto savepoint kicks in? One thing need to note is the data flushing is kinda of a blocking operation.
@danny0405 Sorry for any confusion, I meant the above-mentioned specific scenario: the first checkpoint after savepoint. Normal checkpoint is relatively slow but ok since we had a bad partition key making the number of partitions a bit too many for this small setup.
the first checkpoint after savepoint
Got it, I need to figure out what's the savepoint's effect to the ensueing checkpoint.
Describe the problem you faced
A clear and concise description of the problem. "I use Flink cdc to read MySQL data, and then write it to S3 through hudi. I often encounter checkpoint org.apache.Flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold." "The common problem is that a checkpoint failure occurs every 20 minutes. I have no problems running on a local machine, but when I go to an EKS cluster, this problem occurs."
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version : 0.13.0
Flink version:1.15.3
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) :yes ,EKS
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.