TatianaJin / whippet_sort

Apache License 2.0
2 stars 2 forks source link

[Feat] Data Generator #2

Open CaiusDai opened 9 months ago

CaiusDai commented 9 months ago

What is this issue about

A data generator that can generate configurable parquet files is needed for later benchmarks.

Goals

Ideally, the generator will provide two functionalities:

  1. Generate Logical Data. This part will focus on generating the real data content for parquet file, including the schema, data size, distribution etc. .
  2. Generate physical parquet file. This part will focus on transforming the logical data into a parquet file, deciding the data's physical layout, including encoding and compression methods.

For more flexiblility, the generator is designed to also accept json file configuration (for complex or repeatable data generation).

Generator Configurations

Configurable factors for Logical Data Generator

  1. Number of columns and number of rows.
  2. Data type for each column
  3. Data distrbution for each column
  4. Cardinality for each column
  5. Null value frequency

Configurable factors for Physical Data Generator

  1. Encoding method for each column.
  2. Compression method for each column.
CaiusDai commented 9 months ago

Please let me know if any configurable factor is not reasonable or more factors are needed. I will work on the logical data generator first.

TatianaJin commented 9 months ago

Data distrbution for each column: number of distinct values & value occurrence distribution?