Open eapframework opened 3 years ago
The problem is that the histogram analyzer needs to compute the exact count for each bucket, so there is no way to avoid shuffling.
I am facing this issue too, as the number of columns go beyond 100 the performance deteriorates. Can we build a histogram analyser that uses approx distinct for faster calculation ? @sscdotopen
That should be possible, would you like to work on that?
@sscdotopen yes would want to contribute on this.
What do you guys feel about batching total columns by 100 columns in each batch. has worked well for us. @sscdotopen @utkarshshukla2912
Hi,
I am trying to perform Histogram analyzer on 150 columns by appending columns from list. Code is working fine, but it is taking nearly an hour to run.
Can you please help optimize the performance? I am running with 50 executors, 3 cores per executor, and spark.sql.shuffle.partitions = 150.
Is it possible to load each column in separate executor to improve performance. I feel the record shuffle among executors is reducing the performance.
Thanks!