apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.18k stars 2.38k forks source link

[SUPPORT] using spark's observe feature on dataframes saved by hudi is stuck #11367

Open szingerpeter opened 4 weeks ago

szingerpeter commented 4 weeks ago

Describe the problem you faced

When trying to use the observe function on dataframes saved by hudi the application gets stuck after saving the data and trying to retrieve the statistics.

To Reproduce

Steps to reproduce the behavior:

from pyspark.sql import DataFrame, Observation
from pyspark.sql import functions as F
observation = Observation()

df = spark.createDataFrame([[1, 1], [2, 2], [3, 3], [4, 4]])

df = df.observe(observation, F.count(F.lit(1)).alias('row_count'))

df.write.format('csv').mode('overwrite').save('file:/opt/spark/work-dir/test_csv')

observation.get # returns: {'row_count': 4}

observation2 = Observation()
df2 = spark.createDataFrame([[1, 1], [2, 2], [3, 3], [4, 4]])

hudi_options = {
    'hoodie.table.name': 'test',
    'hoodie.datasource.write.recordkey.field': '_1',
    'hoodie.datasource.write.partitionpath.field': '',
    'hoodie.datasource.write.table.name': 'test',
    'hoodie.datasource.write.operation': 'insert_overwrite',
    'hoodie.datasource.write.precombine.field': '_2',
}

df2 = df2.observe(observation2, F.count(F.lit(1)).alias('row_count'))

df.write.format("hudi").\
    options(**hudi_options).\
    mode("overwrite").\
    save('file:/opt/spark/work-dir/test')

observation2.get # gets stuck

Disclaimer: I know there are hudi metrics and callbacks; however, i would like to add some more advanced quality checks to our applications

Environment Description

szingerpeter commented 3 weeks ago

@codope is there anything else needed from my side at the moment?

ad1happy2go commented 3 weeks ago

@szingerpeter I will look into it. Sorry for the delay here.

szingerpeter commented 3 weeks ago

@ad1happy2go , thank you!

szingerpeter commented 1 week ago

@ad1happy2go , did you have a chance to take a look at the issue?

ad1happy2go commented 1 week ago

@szingerpeter Sorry again. I got swamped in some other urgent tasks. Will try to look into it by end of this week.