Open revans2 opened 3 months ago
@Feng-Jiang28 you are 100% correct that this does not do anything about correlation between columns either in a data set or between data sets. It also assumes that all of the data has a normal distribution, which is not always accurate. I started out with features that would help us with testing JSON, and get_json_object specifically. I also wanted to gather statistics that could be used directly with the datagen tool.
If you feel that there isn't enough value in the current set of statistics, then we can expand the scope to make it more useful/representative, but we have to balance that with the time it takes to calculate those statistics. Also the stats will never be perfect, so for me I really want to start with a minimum viable product, and then we can expand on it more.
Do we need to replicate all joins and aggregates accurately? What types of joins? Are we only concerned about equality joins or do we need to support other join types too? Do we need a way to map scalar values in a query to corresponding scalar values in the generated data? Do we need to detect if/how the data is laid out on disk? Is it partitioned or sorted? What about strings with specific formats. We have some things for JSON now, but what about dates/timestamps?
To me I can see equality being important and I would be happy to have us add in some features for making sure that we can do equality joins and aggregates at least somewhat accurately.
The datagen tool can deal with equality joins/aggregates by limiting the range and distribution of seeds that a column, or set of columns, use to generate data. https://github.com/NVIDIA/spark-rapids/blob/branch-24.10/datagen/README.md
Typically I will manually look at the columns/rows going into a join and the number of rows coming out. We can then set up a distribution, typically flat, and key ranges so that each of the keys on both sides of the join should have a fairly common number of duplicates to match the output we saw. But that is not really the best way to do that. It would be much better to take the input values and match a distribution of the input values to one that we can support along with statistics on how to represent that distribution. Is it flat, distinct, normal, exponential, or something else. I am no stats expert, but that can get to be rather expensive because it would require at least two passes over the data. One to calculate the mean and standard deviation, and another to test if the distribution matches what we calculated. We could probably get distinct distribution from what we have today because we are doing an approximate count distinct on the keys and that would not be too difficult.
If we do have a mean + stddev + distribution it can work for numeric values, but I don't know how to do that with a string. What is the mean of a string? what is the stddev of a string. We could do it on a hash of the string. Bu that is far from a great solution. I also don't know if we need to obfuscate the mean and stddev of numeric values. If that might be sensitive for some customer.
(Repost the deleted comment) This is really an interesting feature, by this tool, we can probably be able to reproduce likely tables of product environment, and we can debug locally instead of running the plugin in product environment. I know there are some tools like pandas-profiling which can be use to generare comprehensive profiling data about a dataframe, or a dataset. But I think it doesn't have the ability to generate dummy tables in the tool(not sure). If the tools are mainly about reproduce issues. It is worth to mention that there might be a lot of features of the dataset need to be consider about(in order to reproduce the likely dataset of product env). Like when joining, payload columns, distribution type, zipf_factor, selectivity.... they play important roles in affecting performance. Maybe in the development of the tool, we need to make sure what features of input we need.
This draft guideline outlines data colletion procesures for EDA (Exploratory Data Analysis). The guideline aims to facilitate the development of the Fingerprinting tool capable of collecting fingerprint data and then help datagen to generate data.
Pandas Profiling Tool supports 9 types of Data:
I will put 3 representive datatypes here.
example: Age, Balance, Duration | Data | Explanation |
---|---|---|
Central Tendency | ||
Mean | Average of the values. | |
Median | Middle value when sorted. | |
Mode | Most frequently occurring value. | |
Range | ||
Minimum | Smallest value. | |
Maximum | Largest value. | |
Dispersion | ||
Standard Deviation | Average distance from the mean. | |
Variance | Squared standard deviation. | |
Quantiles | ||
5th Percentile | Value below which 5% of data falls. | |
Q1 (25th Percentile) | Lower quartile. | |
Q3 (75th Percentile) | Upper quartile. | |
95th Percentile | Value below which 95% of data falls. | |
Distribution Shape | ||
Skewness | Asymmetry of the distribution. | |
Kurtosis | Peakedness of the distribution. | |
Special Cases | ||
Zeros Count | Number of zero values. | |
Negative Count | Number of negative values. |
Examples: Job, Matiral Status, Education | Data | Explanation |
---|---|---|
Length | ||
Max length | The longest string in the dataset. | |
Median length | The middle value of string lengths. | |
Mean length | The average length of the strings. | |
Min length | The shortest string in the dataset. | |
Characters and Unicode | ||
Total characters | Total number of characters. | |
Distinct characters | Unique characters used in the dataset. | |
Distinct categories | General_Category | |
Distinct scripts | Refers to one writing system, like Latin. | |
Distinct blocks | Unicode_block | |
Unique | ||
Unique | Unique entiries in the dataset. |
Name | Explanation |
---|---|
Distinct | Number of unique Boolean values. |
Distinct (%) | Percentage of distinct values compared to the total number of entries. |
Missing | Count of missing or null entries in the column. |
Missing (%) | Percentage of entries that are missing. |
Count | Total number each Boolean Value in the column. |
Frequency(%) | Percentage of each Boolean value. |
Name | Explanation |
---|---|
Relation Size | Size of relation R and S tables. |
Payload | Number of payload columns in relations. |
Join Type | primary key-foreign key pkfk or foreign key-foreign key fkfk. |
Distribution | Type of distribution for data(Uniform or Zipf). |
Selectivity | Selectivity of the join (only valid for pkfk). |
Unique Keys | Number of unique keys (only valid for fkfk). |
Zipf Factor | Zipf factor for data distribution |
Is your feature request related to a problem? Please describe.
https://github.com/NVIDIA/spark-rapids/pull/11060 added in an ability to do some basic fingerprinting of JSON columns. But it is rough and needs to be cleaned up. Also it would be great if we could extend these concepts to more than just JSON columns so we could fingerprint an entire table with a single command.
Describe the solution you'd like First I want us to update/refactor the fingerprint code added in #11060 so it is in a separate file and it is not JSON specific.
The core ideas behind it are good, but it needs to be extended a little. We still need to calculate a few statistics about each column/child column.
Along with all of this we need a "path". The path is supposed to provide enough information that we can reproduce the original input schema along with the same kind of information used for JSON generation so we can generate JSON data with conflicting types, like the current JSON fingerprinting tool does today.
All of this would be stored in a parquet table so we can read it back in and generate representative data.
The base framework for the JSON processing is decent. It should be able to scale to very large numbers of columns and values, so I don't see a lot of need to completely replace it. But we do need to make it so it can handle multiple different columns with different data types.
The plan would be to look at the schema of the DataFrame passed in, along with hints that specific string columns should be introspected as JSON. Ideally this "hints" should be generic enough that other introspection formats could be added in the future, like CSV, date, timestamp, or XML, and that we could provide multiple introspection hints for a single string column.
The current code uses a UDF that takes a single string input and returns an array of tokenized paths, types, lengths and values. This is then exploded out and aggregations are calculated grouping by the paths.
To extend this we still have a UDF that would be called for each top level column. But it would take in an AnyRef for the column data itself along with some representation of the data type for this column, and any string hints that are needed for the child columns.
It would do the conversion to an array of
{path, data type, length, and value}
except herevalue
would be a struct, with 4 type specific sub-columns.Note that lists, maps, etc do not fit into this. The leaf nodes would be converted to fit into one of these columns, but the non-leaf nodes would return a null for each value in the struct, but a non-null struct. However any null input value, even if it is for a non-leaf node, would return a null struct instead of null values in the structs.
The rest of the aggregations should be very similar to what they are today except. The count should be a count on value instead of a count of 1.
I am not sure we want to have the version number as a column in the data file. It might be better to just have it be metadata in the file, or something in the name of the path it is stored at. Not sure, but it does not matter much.
Reading the fingerprint back in and using it to generate data should be very similar to how it is done for JSON. Except we would have to create the table itself instead of just taking in a pre-made table column.
Ideally in the future we would have a way to fingerprint multiple tables and point out relationships between columns in those tables. Essentially things like
a.foo
andb.bar
are joined together so we should calculate overlap between the two columns. But that is something that should be left to later. The only reason I mention it is because it might be nice to have the APIs setup so that we store fingerprints for multiple tables under a single directory. That way we can have a table name stored with it, possibly as a separate column.Note that we want to keep the fingerprint anonymization and some kind of a command line tool to make running this simpler. Also these should be documented. I didn't document the original fingerprint tool because I knew it was not done and still very much a work in progress.