[DISCUSSION] Project Goal

wgtmac commented 5 hours ago

I'd like to create this very first issue to collect ideas from people who have an interest. Below are what's in my mind:

Platform: Linux, MacOS, Windows.
Compilers: Clang, GCC, MSVC.
Build: CMake.
C++ standard: C++20
Dependencies: Arrow, Avro, ORC, simdjson, etc.
Coding style: Follow what Apache Arrow C++ does: https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
Features: I'd like to say all. But to be realistic, we need to break down work items and define API first. I think at least following categories are required:
- data type and schema (arrow::Schema and its extension type?)
- metadata object (table, partition spec, manifest, file, etc)
- data representation: row-wise (Avro record?) and columnar (arrow::RecordBatch?)
- expression (arrow::Expression?)
- I/O: (leverage arrow::FileSystem?)
- reader/writer: a common abstraction over parquet/orc/avro
- catalog
- ...

wgtmac commented 5 hours ago

I have made a bold suggestion that the type system to directly leverage Arrow C++ to avoid re-invent the wheels and benefit from RecordBatch, Expression and other stuff. I saw that iceberg-rust and iceberg-go have implemented its own data types. Is there any issue that arrow type system is unable to deal with iceberg type system? @Xuanwo @zeroshade

zeroshade commented 5 hours ago

The biggest drawback to just using the Arrow C++ type system directly is that the mappings aren't perfect for iceberg.

Iceberg only has Int32 and Int64 while Arrow has Int 8/16/32/64 and Uint 8/16/32/64. The same goes for all of the other types that exist in Arrow but don't exist for Iceberg (such as the Large* variants, REE, and so on). Another issue is how Time and Timestamp types are handled: Iceberg defines the unit to be milliseconds while Arrow parameterizes the types. For the most part you can see the logic needed for converting between Iceberg and Arrow type systems here

The differences in the types means that even if you re-use the types from Arrow, you're still going to eventually have to perform a conversion / implement this logic when it comes to reading/writing data and converting it to Arrow. This is why I provided functions to convert an Arrow Schema to Iceberg and vice-versa in the iceberg-go library. Reading data still returns a stream of Arrow record batches, and when I implement writing, it'll accept a stream of Arrow record batches to write.

It's not that there's specific issues the Arrow type system can't deal with, it's more that there are significantly more types and flexibility in the Arrow type system than what is available in the Iceberg type system.

apache / iceberg-cpp

[DISCUSSION] Project Goal #2