apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[Parquet][C++] More elaborate dictionary fallback for Parquet 2.0 #15165

Open mapleFU opened 1 year ago

mapleFU commented 1 year ago

Describe the enhancement requested

In our standard ( https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8 ) , the dict fallback will goes to PLAIN encoding.

But in many implementions, it says that Parquet 2.0 should support fallback to other types, in our code, there is some todo:

    // Only PLAIN encoding is supported for fallback in V1
    // TODO(majetideepak): Use user specified encoding for V2
    if (dictionary_fallback) {
      thrift_encodings.push_back(ToThrift(Encoding::PLAIN));
    }

And, in parquet-mr and arrow-rs, they both support fallback to other types:

  1. parquet-mr: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java
  2. arrow-rs: https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L1028-L1040

So, should we support that?

Component(s)

C++, Parquet

mapleFU commented 1 year ago

@pitrou mind take a look?

pitrou commented 1 year ago

Eek. I opened https://issues.apache.org/jira/browse/PARQUET-2221 for the defective spec.

pitrou commented 1 year ago

It's not obvious to me that this is used for dict fallback. Looking at the dict_fallback() function I don't see any place where the encoder is switched to a non-dict encoder. @sunchao

mapleFU commented 1 year ago

Maybe the code is

    fn try_new(descr: &ColumnDescPtr, props: &WriterProperties) -> Result<Self> {
        let dict_supported = props.dictionary_enabled(descr.path())
            && has_dictionary_support(T::get_physical_type(), props);
        let dict_encoder = dict_supported.then(|| DictEncoder::new(descr.clone()));

        // Set either main encoder or fallback encoder.
        let encoder = get_encoder(
            props
                .encoding(descr.path())
                .unwrap_or_else(|| fallback_encoding(T::get_physical_type(), props)),
        )?;

        let statistics_enabled = props.statistics_enabled(descr.path());

        let bloom_filter = props
            .bloom_filter_properties(descr.path())
            .map(|props| Sbbf::new_with_ndv_fpp(props.ndv, props.fpp))
            .transpose()?;

        Ok(Self {
            encoder,
            dict_encoder,
            descr: descr.clone(),
            num_values: 0,
            statistics_enabled,
            bloom_filter,
            min_value: None,
            max_value: None,
        })
    }

Seems that it creates a encoder when build the writer