brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 67 forks source link

Support compression options for Parquet output #4754

Open ethack opened 1 year ago

ethack commented 1 year ago

Similar to how the -zng.compress flag enables compression for the ZNG format, I'd like to request flags that enable and configure the compression for Parquet files.

Ideally, the compression algorithm and compression level would be exposed as flags.

pqarrow.NewFileWriter() is called here with nil for the props *parquet.WriterProperties parameter.

Supported codecs are here.

Relevant docs:

My go is pretty rusty, but I'm thinking it could be something like this (untested). And getting values from the cli flags isn't done:

import (
    ...
    "github.com/apache/arrow/go/v12/parquet"
    "github.com/apache/arrow/go/v12/parquet/compress"
    ...
)

func NewWriter(wc io.WriteCloser) *Writer {
    w := arrowio.NewWriter(wc)
    w.NewWriterFunc = func(w io.Writer, s *arrow.Schema) (arrowio.WriteCloser, error) {

        // These should come from CLI flags and revert to defaults if not included
        codec := compress.Codecs.Zstd
        level := 9
        props := pqarrow.NewWriterProperties(
                parquet.WithCompression(codec),
                parquet.WithLevel(level))

        fw, err := pqarrow.NewFileWriter(s, zio.NopCloser(w), props, pqarrow.DefaultWriterProps())
        if err != nil {
            return nil, fmt.Errorf("%w: %s", arrowio.ErrUnsupportedType, err)
        }
        return fw, nil
    }
    return &Writer{w}
}

Side note: I guess these shouldn't be used because they are "internal" but seems like they would have been useful otherwise.

philrz commented 1 year ago

@ethack: Thanks for the suggestion and the pointers to all the details. The timing is good because the Dev team may soon be circling back to fill in some of the gaps in our Parquet support, so this may be something they could take care of as part of that effort. I'll let you know more once the team has had a chance to discuss.