awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Performance impact when trying to generate profiling report for more than 200 columns #534

Open eapframework opened 9 months ago

eapframework commented 9 months ago

Encountering performance issues when generating a profiling report for more than 200 columns across 5 million records. I am applying almost all the metrics to generate profiling report. Applied metrics such as datatype, entropy, minimum, maximum, sum, standard deviation, mean, maxlength, minlength, histogram, completeness, distinctness, uniquevalueratio, uniqueness, countdistinct, and correlation. I am trying to generate report similar to ydata-profiling(https://github.com/ydataai/ydata-profiling)

The job has been running for over 3 hours despite attempts to optimize Spark configuration. When checking the logs each metrics is calculated sequentially. Sequential computation of each metric is causing the prolonged runtime. Is it possible to parallelize this operation for improved efficiency?

rdsharma26 commented 8 months ago

Thanks for the feedback @eapframework We will investigate this issue and get back to you with an update.

eapframework commented 7 months ago

Hi @rdsharma26, I was doing more testing. By analyzing the spark execution tasks, I believe the performance issue is because for metrics such as CountDistinct, Histogram, each metrics calculation is done on each column in sequential manner. So more columns in dataframe is causing the job to run for more time. Parallelizing these calculations would enhance efficiency.