Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
472 stars 59 forks source link

Vectorized Parquet Read In Spark DataSource #90

Closed mccheah closed 5 years ago

mccheah commented 5 years ago

The Parquet file format reader that is available in core Spark includes a number of optimizations, the main one which is in vectorized columnar reading. In considering a potential migration from the old Spark readers to Iceberg, one would be concerned about the gap in performance that comes from lacking Spark's numerous optimizations in this space.

It is not clear what is the best way to incorporate these optimizations into Iceberg. One option would be to propose moving this code from Spark to parquet-mr. Another would be to invoke Spark's parquet reader directly here, but that is internal API. We could implement vectorized reading directly in Iceberg, but that is very much to suggest that we reinvent the wheel.

felixcheungu commented 5 years ago

Isn't this more an issue in the Spark codebase? I think a key prerequisite is to have the Parquet Data Source in Spark on the Data Source v2 API first.

mccheah commented 5 years ago

That's one way to do it; right now though Iceberg implements its own Parquet I/O and it doesn't do vectorized read. It also depends if Iceberg continues to handle the file I/O layer or if Iceberg just serves as the metastore / catalog API and delegates to the Spark datasource.

felixcheungu commented 5 years ago

good point, I'm seeing a lot of problem now with not using the "native" parquet data source..

mccheah commented 5 years ago

I think there's some merit to passing off all the low level optimizations to parquet-mr - that way they're available everywhere, Spark - Iceberg - otherwise. Much of that work isn't specific to Spark, from my understanding.

If we did it that way, the delta between using Spark's Parquet data source or implementing one's own isn't that large. It becomes more of just flavor around how to bootstrap the parquet-mr modules.

rdblue commented 5 years ago

There are a few problems with the vectorized read path in Spark: it only supports flat schemas and has no support for schema evolution.

What we've been planning on is to use the Arrow integration in Spark to get vectorized reads. There's already an implementation of Spark's ColumnarBatch translates from an Arrow RowBatch for PySpark/Pandas integration. Having deserialization directly to Arrow would be a benefit to multiple engines. We're considering using it in the Presto connector as well, and it would work well in a record service (possibly using Flight).

mccheah commented 5 years ago

How would that integrate with the existing Parquet file format? This would be a blocker for many places that rely on Parquet in existing data sources and want to port over to Iceberg (this is what we're thinking of doing). We can't afford to lose the read optimizations there, that would be a performance regression for our Spark applications.

Are there ways to patch the vectorized read path in Spark to address those issues? Or can we support vectorized read with the limited subset of cases where it does work?

rdblue commented 5 years ago

We could create vectorized readers in Iceberg that do the same thing as Spark, but I think it is a better use of time to deserialize to Arrow, which has support for memory optimizations like dictionary-encoded row batches.

All of this work can be thought of as an in-memory representation. Iceberg has an API for deserializing pages to rows, it would add one for deserializing to a column batch.

Agreed that this is a performance blocker if you have flat tables and don't need schema evolution. But I don't think that producing Spark's ColumnarBatch would be much easier than producing Arrow.

mccheah commented 5 years ago

Just wanted to clarify (I'm also not too familiar with the Arrow technology), would supporting the Arrow case give us vectorized read for Parquet files? And as such would writing the Arrow integration give comparable performance to existing Spark on Parquet users?

rdblue commented 5 years ago

Iceberg would deserialize to Arrow's RowBatch, which is a more useful in-memory representation because it could support non-Spark engines. Spark can use Arrow RowBatch because there's a wrapper than translates to its internal columnar representation, ColumnarBatch. The vectorization part is the materialization to Arrow.

Performance should be comparable and would also support projection and nested data.

rdblue commented 5 years ago

I should also note a few things about performance:

mccheah commented 5 years ago

Cool, we can experiment with that but we always use the vectorized reader though. Is there an expected timeline for the Arrow vectorized read to land? Would also be happy to help move that forward.

rdblue commented 5 years ago

I don't have a timeline for Parquet to Arrow reads. @julienledem has some experience with that and can probably comment on the amount of effort it would require.

I built the record-oriented read path here in a couple of weekends, so I think it would go fairly quickly. There's a lot of work we could do here, though. Including getting the APIs right to take advantage of page skipping.

mccheah commented 5 years ago

It looks like Parquet into Arrow conversion is already available for Python at least: https://arrow.apache.org/docs/python/parquet.html. I think there's something similar for C++ but am still exploring the tech: https://github.com/apache/arrow/tree/master/cpp

I wonder what it would take to port this work over to Java? I would hope the algorithms in the implementations would translate between languages nicely.

rdblue commented 5 years ago

I've opened a new issue for this in the ASF project: https://github.com/apache/incubator-iceberg/issues/9

Let's move further discussion there. Thanks!