apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.48k stars 1.37k forks source link

filter2 API performance regression #1583

Open asfimport opened 9 years ago

asfimport commented 9 years ago

The new filter API seems to be much slower (or perhaps I'm using it wrong \:)

Code using an UnboundRecordFilter:

ColumnRecordFilter.column(column,
    ColumnPredicates.applyFunctionToBinary(
    input -> Binary.fromString(value).equals(input)));

vs. code using FilterPredicate:

eq(binaryColumn(column), Binary.fromString(value));

The latter performs twice as slow on the same Parquet file (built using 1.6.0rc2).

Note: the reader is constructed using

ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()

The new filter API based approach seems to create a whole lot more garbage (perhaps due to reconstructing all the rows?).

Reporter: Viktor Szathmáry / @phraktle

Related issues:

Note: This issue was originally created as PARQUET-98. Please see the migration documentation for further details.

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle: Any thoughts on this? No change with 1.6.0rc3.

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle: anyone out there? :)

asfimport commented 9 years ago

Ryan Blue / @rdblue: Could you try out the problem data with VisualVM or another profiling tool to see if there's anything funny going on? It doesn't look like your code is the problem. If anything, I would have guessed the first version runs more slowly because it converts a String to Binary each time it runs.

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle: My initial impression based on a quick profiling run is that it was creating a whole lot more garbage, seemingly instantiating all the protobufs (or just reading all columns of every row?), rather than just the ones matching the expression.

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle: possibly related to https://issues.apache.org/jira/browse/PARQUET-128 ?

asfimport commented 9 years ago

Ryan Blue / @rdblue: PARQUET-128 does look like it might be related. Sorry I haven't had time to track this down lately. Maybe @isnotinvain can take a look.

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: I will try to take a look but it might take be a while before I get a chance. The filter2 API still skips assembly for records that are filtered out (eg does not instantiate a protobuf), and it only visits the columns you have asked for via the column projection API. But it doesn't look at the columns in the filter and project to only those columns, (neither does the unbound record filter approach I think) and it also still visits every one of the columns that you want in the final output record (it does not short circuit – the unbound record filter might short circuit here, but the importance of that depends on the order columns are visited in).

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle:

Tested with 1.6.0rc6, the problem still exists: filter2 API is more than 2x slower...

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: I don't think anything has changed yet that would effect this, so no surprises there.

I've not had a chance to look into the performance difference yet, from your testing is there anything specific you suspect is causing the issue? Or still more investigation needed?

Thanks, Alex

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle:

While I'm not familiar with the internals of the implementation, the main difference seems to be that with filter2, all columns are deserialized for all rows. Wheres the classic approach, only the selector column is involved.

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: Yes, I think that is the main difference. However, I'm a bit surprised it has much performance impact – for parquet to skip a value, it essentially has to deserialize it, so even when we aren't reading some values, we do have to skip them.

asfimport commented 9 years ago

Ryan Blue / @rdblue:

for parquet to skip a value, it essentially has to deserialize it

Maybe the problem is a defensive copy? Converting a byte array to a String would do it, too, which probably happens when the value is set in the object model.

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle: Not knowing much about the implementation, this is just conjecture – but based on the performance results and the extra amount of garbage during profiling, it seems to also read columns that are not needed for the query (e.g. if you have column A, B, C and you try to find rows where C='x', there's no need to read A and B – just when you have a match on C).

But in any case, this slowness is easy to reproduce based on the example I have provided above, I'm sure someone familiar with the internals can figure this out without my guesses ;)

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: @phraktle well that's my point, even though you don't need to read A and B, you've got to skip their current values so you can move on to the next record. That often involves essentially reading the value (think delta encoding for example). But as @rdblue said, we might be doing more work than needed to do the skip. Definitely worth investigating, and definitely room for improvement.

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle:

Not sure how the "classic" filter implementation gets by, but it seems to be doing a much better job (2-3x faster).

asfimport commented 9 years ago

Viktor Szathmáry / @phraktle: Tested with 1.6.0 release, still has this issue.

asfimport commented 9 years ago

Ryan Blue / @rdblue: @phraktle, to save you some time, the 1.7.0 release will also have this problem. I'll find some time to look into it further.