muescha commented 1 week ago

Setup

dolt as mysql server MySQL in a Docker container

Error on Dolt

On Dolt:

SELECT SUM(found)
FROM (
    (SELECT 2 as found FROM dual)
) as all_found;

Error:

restdb> SELECT SUM(found)
        FROM (
            (SELECT 2 as found FROM dual)
        ) as all_found
[2024-06-24 19:43:04] [HY000][1105] Every derived table must have its own alias at position 108 near 'dual'

MySQL

no error

restdb> SELECT SUM(found)
        FROM (
                 (SELECT 2 as found FROM dual)
             ) as all_found
[2024-06-24 19:41:21] 1 row retrieved starting from 1 in 51 ms (execution: 15 ms, fetching: 36 ms)

SUM(found)
2

muescha commented 1 week ago

it seems dolt don't like the second () around the subquery while MySQL accecpt it:

working on dolt:

SELECT SUM(found)
FROM (
    SELECT 2 as found FROM dual
) as all_found;

fulghum commented 1 week ago

Thanks for reporting this one. I'm able to reproduce the error and I see that it's coming from our parser layer. Simply adding a new rule for this syntax results in a shift/reduce conflict, so we'll need to find a different way to support this. That typically either means rewriting existing rules to get rid of the conflict, or customizing the tokenizer to try and work around the conflict.

I'm going to try out a few ideas, but heads up that parser conflicts can sometimes be tricky to untangle. In the meantime, do you have a workaround for this issue? Just wanted to make sure this wasn't a blocking issue.

muescha commented 1 week ago

no blocking issue.

While playing with the query from #8052 I removed the second union and find this special error.

muescha commented 1 week ago

Fun fact:

With UNION there is no error at dolt:

SELECT SUM(found)
FROM (
   (SELECT 1 as found FROM dual)
   UNION
   (SELECT 1 as found FROM dual)
) as all_found;

fulghum commented 1 week ago

I tried a few ideas today and I was able to resolve the grammar conflict, but... only if I removed part of a rule that is providing some helpful error checking.

I added a new production rule to the table_factor rule for the syntax with extra parens wrapping the subquery, but it's ambiguous with this production rule that does some error checking for an alias. Note that this also doesn't fully match MySQL's syntax, since it only supports one extra pair of parens, while MySQL seems to not have a limit. It's also unfortunate that the helpful error message ("Every derived table must have its own alias") simply becomes a "syntax error" message and provides less direction to customers on how to resolve the issue. So, based on those issues, I'm not feeling like this is a good solution.

Nice find on the UNION example above. This one works because the grammar allows a UNION statement to have parens around the left-hand side and the right-hand side and that statement is a subquery that has parens around it. It triggers the same error from the parser if you add another set of parens around the entire statement, like this:

SELECT SUM(found) FROM ( (  (SELECT 1 as found FROM dual)    UNION    (SELECT 1 as found FROM dual) ) ) as all_found;
Error parsing SQL: 
Every derived table must have its own alias at position 104 near 'dual'

This is a good issue for us to track as we think about how we keep evolving our SQL parser to match less commonly used MySQL syntax like this. I don't see a quick fix for enabling this syntax yet, but I've made a new tag (parser) for issues like this and we'll keep this in mind as we schedule larger projects around improving the parser.

Until then, if this does become a blocking issue, please comment and let us know so we can help figure out a path forward or reprioritize this one.

muescha commented 1 week ago

Can it just resoved if there are two (or more) (( have matching )) then reduce it to just one pair of ( and )

Then it should be equal to the MySQL syntax which allows more than two

fulghum commented 1 week ago

Thanks for the suggestion. I think we could make this syntax work with some preprocessing on the string like you mention. It may be just slightly tricky to parse, since there are cases where we can't safely reduce, but I think that's easily doable by simply checking for the correct nesting levels of the parens.

My only hesitations with this approach would be that 1) it adds an additional upfront parsing step that would add extra latency for all queries, even though 99% of queries probably don't need this processing, and 2) the repeated parens are only valid in MySQL in parenthesized query expressions, and by pre-processing, we wouldn't have enough state to understand the semantics of the expression, so we'd apply this paren reduction globally in the statement, which would mean we would end up allowing these repeated parens anywhere parens are used, which would diverge from MySQL's behavior.

It's not a bad idea to explore though, and I think if this does end up becoming a blocking issue for customers, then it might be our best way to work around this until we can get better modeling for parenthesized query expressions into the grammar.

dolthub / dolt

[HY000][1105] Every derived table must have its own alias #8058

Setup

Error on Dolt

MySQL