Open rubenssoto opened 2 years ago
Are there other multiple writers? I am wondering commits after time 20221122225538334
went through.
If you could share your configs and steps to reproduce, that would be great. I understand it may not be easy to repro, in that case, timeline would help.
@codope no, only one writer for each table, we have 3 streamings and those 3 are suffering from the same problem, I'll get my configs to share with you.
df = (
spark.readStream.format("kafka")
.option(
"kafka.bootstrap.servers",
"broker1,broker2,broker3",
)
.option("subscribe", "topic_name")
.option("kafka.sasl.mechanism", "SCRAM-SHA-512")
.option("kafka.security.protocol", "SASL_SSL")
.option(
"kafka.sasl.jaas.config",
f'org.apache.kafka.common.security.scram.ScramLoginModule required username="USERNAME" password="{password}";',
)
.option("startingOffsets", "latest")
.load()
.selectExpr("substring(value, 6) as avro_value")
.select(from_avro(col("avro_value"), schema).alias("data"))
.select(col("data.*"))
)
query = (
df.writeStream.format("hudi")
.option(
"checkpointLocation",
"Checkpoint path",
)
.option("path", "Destination Path")
.outputMode("append")
.trigger(processingTime="60 seconds")
.option("hoodie.table.name", "Table Name")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.table.name", "Table Name")
.option("hoodie.datasource.hive_sync.enable", True)
.option("hoodie.datasource.hive_sync.mode", "hms")
.option("hoodie.datasource.hive_sync.database", "Database Name")
.option("hoodie.datasource.hive_sync.table", "Table Name")
.option(
"hoodie.datasource.hive_sync.partition_extractor_class",
"org.apache.hudi.hive.NonPartitionedExtractor",
)
.option("hoodie.datasource.hive_sync.support_timestamp", "true")
.option("hoodie.upsert.shuffle.parallelism", 50)
.option(
"hoodie.datasource.hive_sync.partition_extractor_class",
"org.apache.hudi.hive.NonPartitionedExtractor",
)
.option(
"hoodie.datasource.write.keygenerator.class",
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
)
.option("hoodie.datasource.write.row.writer.enable", "false")
.option("hoodie.parquet.small.file.limit", 536870912)
.option("hoodie.parquet.max.file.size", 1073741824)
.option("hoodie.parquet.block.size", 536870912)
.option("hoodie.datasource.write.operation", "upsert")
.option("hoodie.datasource.write.precombine.field", "__ts_ms")
.option("hoodie.datasource.compaction.async.enable", False)
.start()
.awaitTermination()
)
We are thinking that the problem is in our side, because we have a job to start the streaming when it stop for any reason, I will update you soon.
@codope : can you check these jiras. we made some fixes already around this. https://issues.apache.org/jira/browse/HUDI-3393 https://issues.apache.org/jira/browse/HUDI-739
@rubenssoto : any updates for us in this regards. if its resolved, can you close it out.
Hello everyone,
We have one spark streaming 24/7 using Hudi COW, we use Spot instances, so, sometimes the machine crash and the streaming is started again. Our source is Kafka.
Hudi: 0.12.1 EMR on EKS: 6.7 Spark: 3.2.1
After two days running the job had these error:
Do you have any idea what could cause this?
Thank you