Improving performance of histogram analyzer on 150 columns

eapframework commented 3 years ago

Hi,

I am trying to perform Histogram analyzer on 150 columns by appending columns from list. Code is working fine, but it is taking nearly an hour to run.

    val columnName = List("col1","col2","col3","col4",,,,,"col149","col150")
    var analysisResult = AnalysisRunner.onData(dataFrame)
    for (column <- columnName) {
          analysisResult = analysisResult.addAnalyzer(Histogram(column)) }
    val metricsResult = { analysisResult.run() }

Can you please help optimize the performance? I am running with 50 executors, 3 cores per executor, and spark.sql.shuffle.partitions = 150.

Is it possible to load each column in separate executor to improve performance. I feel the record shuffle among executors is reducing the performance.

Thanks!

sscdotopen commented 3 years ago

The problem is that the histogram analyzer needs to compute the exact count for each bucket, so there is no way to avoid shuffling.

utkarshshukla2912 commented 3 years ago

I am facing this issue too, as the number of columns go beyond 100 the performance deteriorates. Can we build a histogram analyser that uses approx distinct for faster calculation ? @sscdotopen

sscdotopen commented 3 years ago

That should be possible, would you like to work on that?

utkarshshukla2912 commented 3 years ago

@sscdotopen yes would want to contribute on this.

academy-codex commented 3 years ago

What do you guys feel about batching total columns by 100 columns in each batch. has worked well for us. @sscdotopen @utkarshshukla2912

awslabs / deequ

Improving performance of histogram analyzer on 150 columns #300