apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.61k stars 3.55k forks source link

[C++][Parquet] Deprecate column scanner? #44075

Open pitrou opened 2 months ago

pitrou commented 2 months ago

Describe the enhancement requested

The Scanner and TypedScanner classes are used nowhere in the codebase except in ParquetFilePrinter::DebugPrint. They also have not received any significant maintenance since 5 years ago.

Should we deprecate them? The printing functionality can be folded back in ParquetFilePrinter (or perhaps create a new ParquetColumnPrinter).

Component(s)

C++, Parquet

pitrou commented 2 months ago

@mapleFU @wgtmac WDYT?

mapleFU commented 2 months ago

I'm ok to deprecate it

wgtmac commented 2 months ago

+1

mapleFU commented 2 months ago

IMO, the column scanner is now:

  1. Lack the types mapping
  2. Inefficient
  3. Cannot get the info in arrow

And type-coerce in column scanner is never well-defined. As we're supporting more and more LogicalType, it will becoming more and more useless..

arsnyder16 commented 1 month ago

@mapleFU @pitrou I am currently using the TypedScanner for reading specific columns from parquet files into my own columnar in-memory format.

I would just need to use TypedColumnReader directly with my own batch reading logic?

pitrou commented 1 month ago

@arsnyder16 Yes, you would.