Open alamb opened 4 months ago
take
FYI @Lordworms I think this will be a pretty challenging task
Also, @peter-toth has an outstanding substantial change to TreeNode APIs here: https://github.com/apache/arrow-datafusion/pull/8891 which you should be aware of.
FYI @Lordworms I think this will be a pretty challenging task
Also, @peter-toth has an outstanding substantial change to TreeNode APIs here: #8891 which you should be aware of.
Yes, My basic plan for this is to change recursion in treenode into iteration, I will test my implementation first and read this PR #8891
start to do this one, the current plan is to rewrite the infer_placeholder_types function to iteration.
@Lordworms Are you still working on this?
@Lordworms Are you still working on this?
Sorry I forget about this one, you can go for it.
@alamb I'm thinking of introducing CommutativeExpr
that specialize these kind of query to make them possible to transform iteratively. What do you think about this?
#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct CommutativeExpr {
/// Expressions
pub exprs: Vec<Expr>,
/// The operator that is commutative (order-insensitive), like OR, AND, StringConcat, BitWise Op
pub op: Operator,
}
CommutativeExpr { vec![a, b, c], op: OR}
is equivalent to Binary(Binary(a,b,OR), c, OR)
@alamb I'm thinking of introducing CommutativeExpr that specialize these kind of query to make them possible to transform iteratively. What do you think about this?
I think that sounds like a good idea to me
One thing that would be nice would be to somehow avoid having two potential representions for the same expression
So like A OR B
would that be represented as BinaryExpr { left: A, op: Or, rigith: B}
or CommutativeExpr { exprs: [A, B], op: Or}
🤔
But that is starting to sound like a large change 🤔
@alamb I'm thinking of introducing CommutativeExpr that specialize these kind of query to make them possible to transform iteratively. What do you think about this?
I think that sounds like a good idea to me
One thing that would be nice would be to somehow avoid having two potential representions for the same expression
So like
A OR B
would that be represented asBinaryExpr { left: A, op: Or, rigith: B}
orCommutativeExpr { exprs: [A, B], op: Or}
🤔But that is starting to sound like a large change 🤔
we could restrict the minimum elements of CommutativeExpr to 3.
we could restrict the minimum elements of CommutativeExpr to 3.
That could make sense
My biggest concern with this proposal is its potential impact on backwards compatibility / causing API churn to solve a very narrow usecase
I wonder if you have considered the approach to turning a stack overflow into an error?
So like maybe add a configuration flag like "max_expression_depth = 10" or something and then if that depth is exceeded in SqlToRel
raise an error?
That would protect against crashes/ stack overflows but still allow people who wanted more complex expressions (and were willing to raise their stack sizes) to run
we could restrict the minimum elements of CommutativeExpr to 3.
That could make sense
My biggest concern with this proposal is its potential impact on backwards compatibility / causing API churn to solve a very narrow usecase
I wonder if you have considered the approach to turning a stack overflow into an error?
So like maybe add a configuration flag like "max_expression_depth = 10" or something and then if that depth is exceeded in
SqlToRel
raise an error?That would protect against crashes/ stack overflows but still allow people who wanted more complex expressions (and were willing to raise their stack sizes) to run
I'm able to run the large or list without stack overflow. As far as I can tell. I think we don't need to worry about the backwards compatibility
, this does not need to break any existing API.
causing API churn for narrow usecase
I think we can deal with simple case for the query that has the same operator like large OR list, large AND list, but we can't deal with mixing case like A OR B AND C OR D AND E...
. For this case, we still need to raise runtime error.
If complete large OR / AND list query for the same operator is worth to add a new Expr, I think we could go for it. Otherwise we should count the transformation and raise the error.
Otherwise we should count the transformation and raise the error.
I recommend we start with the error (to avoid a stack overflow) and then if someone comes with a usecase when they need a super deep OR
tree we can figure out if it is worth adding additional code
I think one common case of large OR
chains is generated SQL like col = 1 OR col = 2 OR col = 3 OR ...
and it is typically better to change the application to generate SQL like col IN (1, 2, 3, ...)
instead which is both less SQL (and less nested) and faster
It seems to support configurable max recursive depth involved huge breaking change for tree traversal transfrom_*
function given it is used widely already. I would like to avoid this huge change just for early return error. 😢
Given stack size is not always the same, I'm not sure constant max recursive depth is a good idea too.
Here is how the recursion guard is implemented in sqlparser: https://github.com/sqlparser-rs/sqlparser-rs/blob/f9ab8dcc27fd2d55030b9c5fa71e41d5c08dd601/src/parser/mod.rs#L67-L127
And then at the start of each major statement, this gets called:
So I don't think we would have to change the transform_*
functions, but simply update the closure that was being called by those functions
I came across stacker
(https://docs.rs/stacker/latest/stacker/index.html) and a convenience create (https://docs.rs/recursive/latest/recursive/index.html) above stacker
to dynamically increase stack sizes during deep recursions. According to their documentation it isn't zero cost to use these, but could be worth to try these and see them in action in some benchmarks.
I evaluated stacker as part of sqlparser and I thought it was doing some too crazy stuff that made it hard to use in embedded / wasm environments. Maybe that is better now
Describe the bug
In InfluxDB we saw people issue queries with many
OR
chains that caused a stack overflowTo Reproduce
blowout2.zip
Download: blowout.zip
And run
This results in
The query looks like this
Expected behavior
a runtime error rather than stack overflow. Bonus points if the query actually completed
Additional context
Here is the stack trace in a release build: