Closed xuzifu666 closed 6 months ago
@jonvex @ad1happy2go could you look into this issue?
@xuzifu666 When I tried the below code, it was properly archiving. Can you check below or provide your table/writer configurations.
fake = Faker()
data = [{"transactionId": fake.uuid4(), "EventTime": "2014-01-01 23:00:01","storeNbr" : "1",
"FullName": fake.name(), "Address": fake.address(),
"CompanyName": fake.company(), "JobTitle": fake.job(),
"EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
"RandomText": fake.sentence(), "City": "US",
"State": "NYC", "Country": "US"} for _ in range(5)]
hudi_options = {
"hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.recordkey.field": "transactionId,storeNbr,EventTime",
"hoodie.datasource.write.precombine.field": "Country",
"hoodie.table.name": "huditransaction",
"hoodie.datasource.write.operation" : "insert_overwrite",
"hoodie.datasource.write.partitionpath.field" : "city"
}
pandas_df = pd.DataFrame(data)
df = spark.createDataFrame(pandas_df).withColumn("EventTime", expr("cast(EventTime as timestamp)"))
for i in range(1,100):
(df.write.format("hudi").options(**hudi_options).mode("append").save(PATH))
@xuzifu666 When I tried the below code, it was properly archiving. Can you check below or provide your table/writer configurations.
fake = Faker() data = [{"transactionId": fake.uuid4(), "EventTime": "2014-01-01 23:00:01","storeNbr" : "1", "FullName": fake.name(), "Address": fake.address(), "CompanyName": fake.company(), "JobTitle": fake.job(), "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(), "RandomText": fake.sentence(), "City": "US", "State": "NYC", "Country": "US"} for _ in range(5)] hudi_options = { "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.recordkey.field": "transactionId,storeNbr,EventTime", "hoodie.datasource.write.precombine.field": "Country", "hoodie.table.name": "huditransaction", "hoodie.datasource.write.operation" : "insert_overwrite", "hoodie.datasource.write.partitionpath.field" : "city" } pandas_df = pd.DataFrame(data) df = spark.createDataFrame(pandas_df).withColumn("EventTime", expr("cast(EventTime as timestamp)")) for i in range(1,100): (df.write.format("hudi").options(**hudi_options).mode("append").save(PATH))
@ad1happy2go Hi,you can set partition key 'city' different in range(1,100),it would not archive,please try it again,thanks.
@xuzifu666 In case if you saying this to update above code -
for i in range(1,100):
df = df.withColumn("city", lit(i))
(df.write.format("hudi").options(**hudi_options).mode("append").save(PATH))
With above change, yes there was no commit which was archiving. But I am thinking why they will be even archived, as all partitions data will still be valid and all the commits are valid and should be active.
yes there was no commit which was archiving.
Hi, @ad1happy2go this case is alse archived?
No it's not archiving, But Why you think they should be archived. As all these commits are still valid and should be read in this case, so they should be active only.
No it's not archiving, But Why you think they should be archived. As all these commits are still valid and should be read in this case, so they should be active only.
yes it is not should archive
Sorry @xuzifu666 . Didn't understood. Do you want to say it should archive or not?
Sorry @xuzifu666 . Didn't understood. Do you want to say it should archive or not?
No problem,only want to confirm with you,thanks for your reply. Can close the issue.
Describe the problem you faced
Insert overwrite with replacement instant cannot execute archive
To Reproduce
Steps to reproduce the behavior:
Expected behavior
execute archive in a regular time
Environment Description
Hudi version : 0.14.0
Spark version : 3.2.0
Storage (HDFS/S3/GCS..) : HDFS
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.