Closed ganga4reddy closed 6 months ago
Thanks for reporting the issue. From the error, shard info for the latest committed batch is missing. Do you see the same error if kinesis.metadataCommitterType set to HDFS?
Can you share the data in application log, checkpoint path as well as data in dynamodb (or hdfs) for troubleshooting
I got the same issue:
kinesisStream: <class 'pyspark.sql.dataframe.DataFrame'>
Traceback (most recent call last):
File "/tmp/spark-6b195558-12b0-4563-b261-8731ff4ef459/play_events_sql.py", line 285, in
The code Im using:
df.writeStream .format("parquet") .partitionBy("year", "month", "day", "hour") .option("checkpointLocation", "s3://some_bucket/folder") .option("path", "s3://another_bucket") .start()
A work around is to remove the last commit at
I'll try to reproduce the issue.
@hwanghw if this helps, here is the scenario where I faced the issue
checkpoint locaiton : dynamo table
as a workaround, I switched to s3 as checkpoint locaition and started batchDF.persist() at beginning of batch read and unpersist at end of loop, I am not facing the issue any more with s3
Thanks for the details. If there are more than one action running for the same KDS stream e.g. df.count and df.write. The two actions can trigger 2 jobs which may update the metadata at the same time. This is described in README(https://github.com/awslabs/spark-sql-kinesis-connector/blob/main/README.md#avoid-race-conditions).
Hi, We are starting the kinesis stream read to read from kinesis and write to s3 in EMR instance, Intermittently job fails with the mentioned exception when we start a new instance of job after canceling previous any idea how to resolve the issue
ERROR MicroBatchExecution: Query [id =****, runId = ****] terminated with error java.lang.IllegalStateException: Unable to fetch committed metadata from previous batch id 10. Some data may have been missed at org.apache.spark.sql.connector.kinesis.KinesisV2MicrobatchStream.latestOffset(KinesisV2MicrobatchStream.scala:231) ~[RawAggregation.jar:?] at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:489) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:411) ~[spark-sql_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at
regards GR