Closed clairemcginty closed 4 weeks ago
Thanks for the new feature! I will try to take a look soon and make it to the 1.14.0 release.
As requested on the dev@parquet ML, I'll wait for this PR before starting the releasing process of 1.14.0.
BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.
BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.
I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.
BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.
I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.
Sounds good to me! this PR might take another week or two to get right. It would also be nice to release support for operations like Array#size at the same time, so pushing it to 0.15 would give us time to do that 👍
Sorry for the delay. I will try to finish another pass by the end of this week.
This overall LGTM! Thanks @clairemcginty for working on this and adding exhaustive test!
great, I'm glad this implementation looks ok! I have a few more tests that I'd like to add around null handling + behavior of the Contains
predicate on map types (I think that they should just work, but I haven't tried it out yet...). Will try to add those + address PR comments on Monday or Tuesday next week 👍
I tried adding a test case to TestRecordLevelFilters
to test contains(eq(null))
. My expectation was that if you have an array schema with an optional element type, this should return true if the array contains one or more null elements. However, I don't think this is possible to make work--I set a debugger on ValueInspector#update
and ValueInspector#updateNull
. and ValueInspector#update
is only invoked for non-null elements, and ValueInspector#updateNull
is only invoked if the entire array is null, which isn't exactly what we want, either.
So based on my current understanding, I don't think we can support a contains(eq(null))
predicate, and we can probably add a precondition check to the Contains
constructor against a null predicate value. Wdyt @wgtmac ?
@wgtmac, completely agree to have more people reviewing this. Thanks for pinging. I'll try to take a look this week.
hey @gszadovszky! all your requested changes have been addressed - anything else that's missing?
Thanks @gszadovszky for the detail review! I'll take another pass shortly to be familiar with the latest change and then merge it if no concern.
Proposal to add a new FilterPredicate,
Contains
, that can be applied to List types, and check if the specified element is present among the repeated values. It can be composed using And or Or:The filtering logic is largely based on existing Eq predicates to apply filtering at the page/rowgroup level using statistics/dictionaries, with a specialized implementation in
IncrementallyUpdatedFilterPredicateBuilder
to do individual record-level filtering.Jira
Tests
Commits
Style
mvn spotless:apply -Pvector-plugins
Documentation