Open lr4d opened 3 years ago
For the benchmark you should evaluate larger Datasets since for these small Datasets the overhead of the operations are dominant and not the predicate evaluation.
The most performance critical part is probably not the index filtering but rather the filtering on partitions themselves after loading data. In any case you should be able to construct benchmarks using the filter_array_like function of kartothek.serialization since this is the part where this matters most
The point where I expect a significant drawback of the rewrite is when there are many elements in the value, not just four. What's the motivation for rewriting this?
For the benchmark you should evaluate larger Datasets since for these small Datasets the overhead of the operations are dominant and not the predicate evaluation.
Sure.
What's the motivation for rewriting this?
Less performance-critical code maintenance. And I wonder how this would affect performance. We'd also have simpler predicate handling for internal code, but I'm not sure how important that is
We are building predicates automatically from a dataframe of partitions. The naive approach resulted in predicates which are disjunctions of above 1000 (sometimes 10000) conjunctions. Think
[
[
("a", "in", [f"value_{x}" for x in range(8)]),
("b", "in", [2012, 2013, 2014, 2015, 2016, 2017, 2018]),
("c", "in", [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]),
]
]
which would, if translated, result in 728 conjunctions with "=="
statements. We failed loading any data this way and thus had to add functionality that simplified predicates (combining disjunctions to "in"
statements). So please, when implementing / benchmarking this, consider combined predicates as above and large datasets.
Problem description
We use the
in
operator internally in predicate parsing, but we can just re-write the predicates to use a disjunction of==
terms. e.g.[[('A', 'in', [1, 4, 9, 13])]] -> [[('A', '==', 1)], [('A', '==', 4)], [('A', '==', 9)], [('A', '==', 13)]]
We could implement this re-write when a user passes predicates involving
in
, before the predicates are evaluated. This seems to be as fast as or faster than our current evaluation of predicates in micro-benchmarks (see below).Example code (ideally copy-pastable)