awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.25k stars 532 forks source link

Improving performance of histogram analyzer on 150 columns #300

Open eapframework opened 3 years ago

eapframework commented 3 years ago

Hi,

I am trying to perform Histogram analyzer on 150 columns by appending columns from list. Code is working fine, but it is taking nearly an hour to run.

    val columnName = List("col1","col2","col3","col4",,,,,"col149","col150")
    var analysisResult = AnalysisRunner.onData(dataFrame)
    for (column <- columnName) {
          analysisResult = analysisResult.addAnalyzer(Histogram(column)) }
    val metricsResult = { analysisResult.run() }

Can you please help optimize the performance? I am running with 50 executors, 3 cores per executor, and spark.sql.shuffle.partitions = 150.

Is it possible to load each column in separate executor to improve performance. I feel the record shuffle among executors is reducing the performance.

Thanks!

sscdotopen commented 3 years ago

The problem is that the histogram analyzer needs to compute the exact count for each bucket, so there is no way to avoid shuffling.

utkarshshukla2912 commented 3 years ago

I am facing this issue too, as the number of columns go beyond 100 the performance deteriorates. Can we build a histogram analyser that uses approx distinct for faster calculation ? @sscdotopen

sscdotopen commented 3 years ago

That should be possible, would you like to work on that?

utkarshshukla2912 commented 3 years ago

@sscdotopen yes would want to contribute on this.

academy-codex commented 3 years ago

What do you guys feel about batching total columns by 100 columns in each batch. has worked well for us. @sscdotopen @utkarshshukla2912