The big thing here is with respect to slicing constraints.
EnumerateCaveats needs to know which subset of a dataframe it is
responsible for enumerating caveats on. In order to do this, it
maintains something it calls a Slice: A set of column/expression pairs
that specify a column (that we're interested in caveats on), and a
boolean-valued expression (indicating for which rows we're interested in
caveats on the indicated column). The same technique is used to
indicate for which rows we want caveats.
The problem is that with joins, it's possible for a single slice
condition to refer to expressions on both sides of the join. Previously
this was handled for row-level conditions by setting the condition to
true (i.e., return ALL errors... a problem I think we might have a
Vizier issue for). However, the same check was not happening for
attribute-level predicates.
This commit applies the correct behavior across the entire join, and
also refines it to use an Exists subquery to return a far smaller set of
caveats.
A few other adjustments
Version bump to 0.2.5
Added a bunch of comments
Propagated constraint parameter to DataFrameImplicits.listCaveats
spark.expressionLogic.attributesOfExpression now only returns
correlated attributes for a nested subquery expression (as opposed to
all attributes in the correlating expression)
spark.expressionLogic.inline now ignores undefined attributes rather
than replacing them. This is a little less safe, but not doing this
will require a TON of work to safely handle correlated subqueries.
The big thing here is with respect to slicing constraints. EnumerateCaveats needs to know which subset of a dataframe it is responsible for enumerating caveats on. In order to do this, it maintains something it calls a Slice: A set of column/expression pairs that specify a column (that we're interested in caveats on), and a boolean-valued expression (indicating for which rows we're interested in caveats on the indicated column). The same technique is used to indicate for which rows we want caveats.
The problem is that with joins, it's possible for a single slice condition to refer to expressions on both sides of the join. Previously this was handled for row-level conditions by setting the condition to true (i.e., return ALL errors... a problem I think we might have a Vizier issue for). However, the same check was not happening for attribute-level predicates.
This commit applies the correct behavior across the entire join, and also refines it to use an Exists subquery to return a far smaller set of caveats.
A few other adjustments