Closed billpwchan closed 2 years ago
Fixed in e8808cc0f8ae971042eeb44cb6dc7a80cad4bbde
A quick comparision between CSV and Parquet
CSV | Parquet -- | -- Row-based storage format. | A hybrid of Row-based and column-based storage formats. It consumes a lot of space as no default compression option is available. For example, a 1TB file will occupy the same space when stored on Amazon S3 or any other cloud. | Compresses data while storing, thus consuming less space. A 1 TB file stored in Parquet format will take up only 130GB of space. Query run time is slow because of the row-based search. For each column, every row of data has to be retrieved. | Query time is about 34 times faster because of the column-based storage and presence of metadata. More data has to be scanned per query. | About 99% less data is scanned for the execution of the query, thus optimizing performance. Most storage devices charge based on the storage space, so CSV format means the high storage cost. | Less storage cost as data is stored in compressed, encoded format. File schema has to be either inferred (leading to errors) or supplied (tedious). | File schema is stored in the metadata. The format is suitable for simple data types. | Parquet is suitable even for complex types like nested schemas, arrays, dictionaries.