[FEA] Big Data Fingerprinting tool

revans2 commented 4 months ago

Is your feature request related to a problem? Please describe.

https://github.com/NVIDIA/spark-rapids/pull/11060 added in an ability to do some basic fingerprinting of JSON columns. But it is rough and needs to be cleaned up. Also it would be great if we could extend these concepts to more than just JSON columns so we could fingerprint an entire table with a single command.

Describe the solution you'd like First I want us to update/refactor the fingerprint code added in #11060 so it is in a separate file and it is not JSON specific.

The core ideas behind it are good, but it needs to be extended a little. We still need to calculate a few statistics about each column/child column.

A count (how many non-null values there are)
the mean length
- for arrays/maps this is how many entries there are in it
- For strings this is how many bytes there are in it
- For numbers this is essentially log base 10 of the value.
The standard deviation of the length (length is the same as for the mean)
an approximate_count_distinct of the column
- Note that this is really only going to be used by leaf columns, but having it for others is not the end of the world.

Along with all of this we need a "path". The path is supposed to provide enough information that we can reproduce the original input schema along with the same kind of information used for JSON generation so we can generate JSON data with conflicting types, like the current JSON fingerprinting tool does today.

All of this would be stored in a parquet table so we can read it back in and generate representative data.

The base framework for the JSON processing is decent. It should be able to scale to very large numbers of columns and values, so I don't see a lot of need to completely replace it. But we do need to make it so it can handle multiple different columns with different data types.

The plan would be to look at the schema of the DataFrame passed in, along with hints that specific string columns should be introspected as JSON. Ideally this "hints" should be generic enough that other introspection formats could be added in the future, like CSV, date, timestamp, or XML, and that we could provide multiple introspection hints for a single string column.

The current code uses a UDF that takes a single string input and returns an array of tokenized paths, types, lengths and values. This is then exploded out and aggregations are calculated grouping by the paths.

To extend this we still have a UDF that would be called for each top level column. But it would take in an AnyRef for the column data itself along with some representation of the data type for this column, and any string hints that are needed for the child columns.

It would do the conversion to an array of {path, data type, length, and value} except here value would be a struct, with 4 type specific sub-columns.

long (this would be used for all byte, short, int, long, date, and timestamp values, along with any DECIMAL_32 or DECIMAL_64 compatible values)
double (this would be used for all float and double values)
Decimal(38,0) (this would be used for all DECIMAL_128 compatible values)
String - anything else that can be represented as a string.

Note that lists, maps, etc do not fit into this. The leaf nodes would be converted to fit into one of these columns, but the non-leaf nodes would return a null for each value in the struct, but a non-null struct. However any null input value, even if it is for a non-leaf node, would return a null struct instead of null values in the structs.

The rest of the aggregations should be very similar to what they are today except. The count should be a count on value instead of a count of 1.

I am not sure we want to have the version number as a column in the data file. It might be better to just have it be metadata in the file, or something in the name of the path it is stored at. Not sure, but it does not matter much.

Reading the fingerprint back in and using it to generate data should be very similar to how it is done for JSON. Except we would have to create the table itself instead of just taking in a pre-made table column.

Ideally in the future we would have a way to fingerprint multiple tables and point out relationships between columns in those tables. Essentially things like a.foo and b.bar are joined together so we should calculate overlap between the two columns. But that is something that should be left to later. The only reason I mention it is because it might be nice to have the APIs setup so that we store fingerprints for multiple tables under a single directory. That way we can have a table name stored with it, possibly as a separate column.

Note that we want to keep the fingerprint anonymization and some kind of a command line tool to make running this simpler. Also these should be documented. I didn't document the original fingerprint tool because I knew it was not done and still very much a work in progress.

revans2 commented 3 months ago

@Feng-Jiang28 you are 100% correct that this does not do anything about correlation between columns either in a data set or between data sets. It also assumes that all of the data has a normal distribution, which is not always accurate. I started out with features that would help us with testing JSON, and get_json_object specifically. I also wanted to gather statistics that could be used directly with the datagen tool.

If you feel that there isn't enough value in the current set of statistics, then we can expand the scope to make it more useful/representative, but we have to balance that with the time it takes to calculate those statistics. Also the stats will never be perfect, so for me I really want to start with a minimum viable product, and then we can expand on it more.

Do we need to replicate all joins and aggregates accurately? What types of joins? Are we only concerned about equality joins or do we need to support other join types too? Do we need a way to map scalar values in a query to corresponding scalar values in the generated data? Do we need to detect if/how the data is laid out on disk? Is it partitioned or sorted? What about strings with specific formats. We have some things for JSON now, but what about dates/timestamps?

To me I can see equality being important and I would be happy to have us add in some features for making sure that we can do equality joins and aggregates at least somewhat accurately.

The datagen tool can deal with equality joins/aggregates by limiting the range and distribution of seeds that a column, or set of columns, use to generate data. https://github.com/NVIDIA/spark-rapids/blob/branch-24.10/datagen/README.md

Typically I will manually look at the columns/rows going into a join and the number of rows coming out. We can then set up a distribution, typically flat, and key ranges so that each of the keys on both sides of the join should have a fairly common number of duplicates to match the output we saw. But that is not really the best way to do that. It would be much better to take the input values and match a distribution of the input values to one that we can support along with statistics on how to represent that distribution. Is it flat, distinct, normal, exponential, or something else. I am no stats expert, but that can get to be rather expensive because it would require at least two passes over the data. One to calculate the mean and standard deviation, and another to test if the distribution matches what we calculated. We could probably get distinct distribution from what we have today because we are doing an approximate count distinct on the keys and that would not be too difficult.

If we do have a mean + stddev + distribution it can work for numeric values, but I don't know how to do that with a string. What is the mean of a string? what is the stddev of a string. We could do it on a hash of the string. Bu that is far from a great solution. I also don't know if we need to obfuscate the mean and stddev of numeric values. If that might be sensitive for some customer.

Feng-Jiang28 commented 3 months ago

(Repost the deleted comment) This is really an interesting feature, by this tool, we can probably be able to reproduce likely tables of product environment, and we can debug locally instead of running the plugin in product environment. I know there are some tools like pandas-profiling which can be use to generare comprehensive profiling data about a dataframe, or a dataset. But I think it doesn't have the ability to generate dummy tables in the tool(not sure). If the tools are mainly about reproduce issues. It is worth to mention that there might be a lot of features of the dataset need to be consider about(in order to reproduce the likely dataset of product env). Like when joining, payload columns, distribution type, zipf_factor, selectivity.... they play important roles in affecting performance. Maybe in the development of the tool, we need to make sure what features of input we need.

Feng-Jiang28 commented 3 months ago

Draft guideline of data collecton:

This draft guideline outlines data colletion procesures for EDA (Exploratory Data Analysis). The guideline aims to facilitate the development of the Fingerprinting tool capable of collecting fingerprint data and then help datagen to generate data.

Pandas Profiling Tool supports 9 types of Data:

Boolean
Numeriacal
Date
Categorical
Time-series
URL
Path
File
Image

I will put 3 representive datatypes here.

Numeric Column

example: Age, Balance, Duration	Data	Explanation
Central Tendency
Mean	Average of the values.
Median	Middle value when sorted.
Mode	Most frequently occurring value.
Range
Minimum	Smallest value.
Maximum	Largest value.
Dispersion
Standard Deviation	Average distance from the mean.
Variance	Squared standard deviation.
Quantiles
5th Percentile	Value below which 5% of data falls.
Q1 (25th Percentile)	Lower quartile.
Q3 (75th Percentile)	Upper quartile.
95th Percentile	Value below which 95% of data falls.
Distribution Shape
Skewness	Asymmetry of the distribution.
Kurtosis	Peakedness of the distribution.
Special Cases
Zeros Count	Number of zero values.
Negative Count	Number of negative values.

Categorical Columns (Text Columns)

Examples: Job, Matiral Status, Education	Data	Explanation
Length
Max length	The longest string in the dataset.
Median length	The middle value of string lengths.
Mean length	The average length of the strings.
Min length	The shortest string in the dataset.
Characters and Unicode
Total characters	Total number of characters.
Distinct characters	Unique characters used in the dataset.
Distinct categories	General_Category
Distinct scripts	Refers to one writing system, like Latin.
Distinct blocks	Unicode_block
Unique
Unique	Unique entiries in the dataset.

Boolean Columns

Name	Explanation
Distinct	Number of unique Boolean values.
Distinct (%)	Percentage of distinct values compared to the total number of entries.
Missing	Count of missing or null entries in the column.
Missing (%)	Percentage of entries that are missing.
Count	Total number each Boolean Value in the column.
Frequency(%)	Percentage of each Boolean value.

In academic research on join operations, the following information is typically collecte.

Name	Explanation
Relation Size	Size of relation R and S tables.
Payload	Number of payload columns in relations.
Join Type	primary key-foreign key pkfk or foreign key-foreign key fkfk.
Distribution	Type of distribution for data(Uniform or Zipf).
Selectivity	Selectivity of the join (only valid for pkfk).
Unique Keys	Number of unique keys (only valid for fkfk).
Zipf Factor	Zipf factor for data distribution

NVIDIA / spark-rapids