Currently, read_sql calculates partition boundaries using PERCENTILE_DISC, falling back to using min - max if needed. PERCENTILE_DISC is not scalable, as it involves expensive operations such as windowing and sorting.
Describe the solution you'd like
We should instead calculate percentiles by taking samples from the input table. This will allows trade off btw accuracy and computational complexity (both time and space complexities). The parameter to control this trade off is the sampling size. The large the sampling size, the more accurate, but it will be more computational complex. If smaller sampling size will be more likely to have uneven sized partition (skew).
Is your feature request related to a problem?
Currently,
read_sql
calculates partition boundaries usingPERCENTILE_DISC
, falling back to using min - max if needed.PERCENTILE_DISC
is not scalable, as it involves expensive operations such as windowing and sorting.Describe the solution you'd like
We should instead calculate percentiles by taking samples from the input table. This will allows trade off btw accuracy and computational complexity (both time and space complexities). The parameter to control this trade off is the sampling size. The large the sampling size, the more accurate, but it will be more computational complex. If smaller sampling size will be more likely to have uneven sized partition (skew).
Describe alternatives you've considered
No response
Additional Context
No response
Would you like to implement a fix?
No