apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[Java] Port Row Set abstraction from Drill to Arrow #19512

Open asfimport opened 6 years ago

asfimport commented 6 years ago

Arrow is a great way to exchange data between systems. Somewhere in the process, however, data must be load into, and read out of the Arrow vectors.

Arrow's vector code started with similar code in Apache Drill. The Drill project created a "Row Set" abstraction that:

Reporter: Paul Rogers

Note: This issue was originally created as ARROW-3164. Please see the migration documentation for further details.

asfimport commented 6 years ago

Wes McKinney / @wesm: Sounds like a useful initiative. We're already developed some rows-to-columns functionality in C++ and would be great to expand beyond what we have now, particular around creating neatly-sized record batches. It would be useful to be able to quickly convert to Protobuf or Avro-encoded row data, and back.

One minor point though:

Arrow evolved from Apache Drill.

This isn't quite accurate. Java code from Apache Drill formed the basis for the initial Java codebase in Apache Arrow. I wouldn't say that the project evolved from Apache Drill itself. The project was created by a confluence of open source projects wishing to define an open standard for in-memory columnar data as its first project, with the broader goal of creating reusable libraries for creating database-like systems ("the deconstructed database" we have been calling it). It happened to be that Drill's ValueVectors were already very close to the fully-shredded columnar model that the community desired, and provided a good starting point. The scope of the project has evolved significantly in the meantime.

asfimport commented 6 years ago

Paul Rogers: Hi @wesm, I reworded the passage to say, " Arrow's vector code started with similar code in Apache Drill.". The key point is that the vector data structures, hence read and write challenges, are similar.

If the goal of Arrow is to provide a toolkit for creating databases, then row-to-column "rotation" is a key ingredient, as is solid memory management. As Drill has found, some operations in a DB are inherently row-based (because databases are designed to deal with rows/objects and their attributes.) I'm sure that Dremio (which started with Apache Drill code) has wrestled with similar issues.

Thanks for the head's-up on the evolved project goals. I'm currently in "drinking from the firehose" mode in catching up on the great progress that has been made, such as the continued evolution of the metadata structures from what I saw six months ago.

All that said, it seems reasonable that row-based reading and writing is essential; though we'll want to work out the right set of details for the Arrow context.

For example, one topic sure to come up are the existing Arrow (and Drill) "complex readers and writers." For now, let's simply acknowledge that these existing abstractions exist, and were one of the inspirations for the Row Set abstractions.