NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
50 stars 37 forks source link

[FEA] Fingerprint tables by reading the statistics from external file #1336

Open nartal1 opened 1 month ago

nartal1 commented 1 month ago

Is your feature request related to a problem? Please describe. This issue is to add ability to create representative datasets based on statistics of input datasets that are part of join operation.
This issue is create a prototype and to track exploratory work to get the statistics that is required to generate representative data for joins and HashAggregates(if there are spills) Reference issue: https://github.com/NVIDIA/spark-rapids/issues/11239

amahussein commented 1 month ago

Currently, I am working on a few issues related to datasourceInfo like #1172 I believe this would change the current implementation of extracting datasources and the details of SQLInfo

nartal1 commented 1 month ago

Currently, I am working on a few issues related to datasourceInfo like #1172 I believe this would change the current implementation of extracting datasources and the details of SQLInfo

Thanks @amahussein. I will discuss with you. Currently I am looking into getting the stats from tables(assuming we have table name).