datacoon / undatum

undatum: a command-line tool for data processing. Brings CSV simplicity to JSON lines and BSON
MIT License
48 stars 7 forks source link

Parquet compression #20

Open chapmanjacobd opened 2 years ago

chapmanjacobd commented 2 years ago

It would be nice to have options for compression. Looks like there is no compression by default?

parq RS_2008-04.parquet 

 # Metadata 
 <pyarrow._parquet.FileMetaData object at 0x7f5d6f635490>
  created_by: parquet-cpp-arrow version 7.0.0
  num_columns: 124
  num_rows: 167472
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 53334
ivbeg commented 2 years ago

Actually it uses snappy compression by default since it uses pandas dataframe for conversion and by default pandas uses snappy compression, I don't know why parq tool doesn't show it. I will add compression option too.

ivbeg commented 2 years ago

@chapmanjacobd I've added a compression option to the latest code in the main branch. Example usage:

It supports the following compression codecs: brotli, snappy, lzo, gzip, None