apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

PARQUET-34: Add #contains FilterPredicate for Array columns #1328

Closed clairemcginty closed 4 weeks ago

clairemcginty commented 2 months ago

Proposal to add a new FilterPredicate, Contains, that can be applied to List types, and check if the specified element is present among the repeated values. It can be composed using And or Or:

FilterPredicate predicate = contains(
  eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))
)

FilterPredicate predicate = or(
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))),
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("home")))
)

FilterPredicate predicate = and(
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))),
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("home")))
)

The filtering logic is largely based on existing Eq predicates to apply filtering at the page/rowgroup level using statistics/dictionaries, with a specialized implementation in IncrementallyUpdatedFilterPredicateBuilder to do individual record-level filtering.

Jira

Tests

Commits

Style

Documentation

wgtmac commented 2 months ago

Thanks for the new feature! I will try to take a look soon and make it to the 1.14.0 release.

wgtmac commented 2 months ago

As requested on the dev@parquet ML, I'll wait for this PR before starting the releasing process of 1.14.0.

wgtmac commented 2 months ago

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

gszadovszky commented 2 months ago

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.

clairemcginty commented 2 months ago

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.

Sounds good to me! this PR might take another week or two to get right. It would also be nice to release support for operations like Array#size at the same time, so pushing it to 0.15 would give us time to do that 👍

wgtmac commented 1 month ago

Sorry for the delay. I will try to finish another pass by the end of this week.

clairemcginty commented 1 month ago

This overall LGTM! Thanks @clairemcginty for working on this and adding exhaustive test!

great, I'm glad this implementation looks ok! I have a few more tests that I'd like to add around null handling + behavior of the Contains predicate on map types (I think that they should just work, but I haven't tried it out yet...). Will try to add those + address PR comments on Monday or Tuesday next week 👍

clairemcginty commented 1 month ago

I tried adding a test case to TestRecordLevelFilters to test contains(eq(null)). My expectation was that if you have an array schema with an optional element type, this should return true if the array contains one or more null elements. However, I don't think this is possible to make work--I set a debugger on ValueInspector#update and ValueInspector#updateNull. and ValueInspector#update is only invoked for non-null elements, and ValueInspector#updateNull is only invoked if the entire array is null, which isn't exactly what we want, either.

So based on my current understanding, I don't think we can support a contains(eq(null)) predicate, and we can probably add a precondition check to the Contains constructor against a null predicate value. Wdyt @wgtmac ?

gszadovszky commented 1 month ago

@wgtmac, completely agree to have more people reviewing this. Thanks for pinging. I'll try to take a look this week.

clairemcginty commented 1 month ago

hey @gszadovszky! all your requested changes have been addressed - anything else that's missing?

wgtmac commented 4 weeks ago

Thanks @gszadovszky for the detail review! I'll take another pass shortly to be familiar with the latest change and then merge it if no concern.