DeepLcom / sql-mock

A Python library to test your SQL models using mocked input data
MIT License
33 stars 5 forks source link

ARRAY JOIN operations are not recognised in Clickhouse queries #48

Closed smoothml closed 5 months ago

smoothml commented 6 months ago

Consider the following example:

import os

import pytest
from sql_mock.clickhouse import column_mocks as col
from sql_mock.clickhouse.table_mocks import ClickHouseTableMock
from sql_mock.table_mocks import table_meta

QUERY = """SELECT
    sum(1) AS impressions,
    city,
    browser
FROM
(
    SELECT
        ['Istanbul', 'Berlin', 'Bobruisk'] AS cities,
        ['Firefox', 'Chrome', 'Chrome'] AS browsers
)
ARRAY JOIN
    cities AS city,
    browsers AS browser
GROUP BY
    city,
    browser
"""

@table_meta(query=QUERY)
class QueryMock(ClickHouseTableMock):
    impressions = col.Int(0)
    city = col.String("Istanbul")
    browser = col.String("Firefox")

@pytest.fixture(autouse=True)
def clickhouse_credentials() -> None:
    os.environ["SQL_MOCK_CLICKHOUSE_HOST"] = "127.0.0.1"
    os.environ["SQL_MOCK_CLICKHOUSE_PORT"] = "8123"
    os.environ["SQL_MOCK_CLICKHOUSE_USER"] = ""
    os.environ["SQL_MOCK_CLICKHOUSE_PASSWORD"] = ""

def test_query() -> None:
    expected = [
        {
            "impressions": 1,
            "city": "Istanbul",
            "browser": "Firefox",
        },
        {
            "impressions": 1,
            "city": "Berlin",
            "browser": "Chrome",
        },
        {
            "impressions": 1,
            "city": "Bobruisk",
            "browser": "Chrome",
        },
    ]
    result = QueryMock.from_mocks(input_data=[])
    result.assert_equal(expected)

When testing this, sql-mock will fail the test with the following error:

FAILED tests/test_example.py::test_query - sql_mock.exceptions.ValidationError: You need to provide the following input mocks to run your query: ['browsers', 'cities']

The arguments to the ARRAY JOIN operation should not need to be mocked as they are columns in the sub-query.

smoothml commented 6 months ago

Ah, that's annoying. Thanks for looking into it @DaveDeepl. I wonder whether we could catch this specific exception and ignore it. I'll raise an issue in the sqlglot library about it.

smoothml commented 6 months ago

I have raised an issue with sqlglot: tobymao/sqlglot#3188

smoothml commented 6 months ago

When parsing the query with sqlglot, the joins object has a kind=ARRAY field. For example, for the query in my original when parsed with sqlglot.parse_one you get

Select(
  expressions=[
    Alias(
      this=Sum(
        this=Literal(this=1, is_string=False)),
      alias=Identifier(this=impressions, quoted=False)),
    Column(
      this=Identifier(this=city, quoted=False)),
    Column(
      this=Identifier(this=browser, quoted=False))],
  from=From(
    this=Subquery(
      this=Select(
        expressions=[
          Alias(
            this=Array(
              expressions=[
                Literal(this=Istanbul, is_string=True),
                Literal(this=Berlin, is_string=True),
                Literal(this=Bobruisk, is_string=True)]),
            alias=Identifier(this=cities, quoted=False)),
          Alias(
            this=Array(
              expressions=[
                Literal(this=Firefox, is_string=True),
                Literal(this=Chrome, is_string=True),
                Literal(this=Chrome, is_string=True)]),
            alias=Identifier(this=browsers, quoted=False))]))),
  joins=[
    Join(
      this=Table(
        this=Identifier(this=cities, quoted=False),
        alias=TableAlias(
          this=Identifier(this=city, quoted=False))),
      kind=ARRAY),  # <-- THIS LINE
    Join(
      this=Table(
        this=Identifier(this=browsers, quoted=False),
        alias=TableAlias(
          this=Identifier(this=browser, quoted=False))))],
  group=Group(
    expressions=[
      Column(
        this=Identifier(this=city, quoted=False)),
      Column(
        this=Identifier(this=browser, quoted=False))]))

To fix this bug I think we just need to exclude Join objects of this type in the helpers.validate_all_input_mocks_for_query_provided function.

Somtom commented 6 months ago

I haven't had time to check yet, but it could be that we just need to make the helpers.get_source_tables function more robust:

Right now it uses sqlglot's root.traverse(). There we loop over the sources. We could try to check for that source not to be of kind array. Maybe that already solves the problem.

smoothml commented 6 months ago

I'm not sure that assumption holds as it's certainly possible to combine joins:

SELECT
    sum(1) AS impressions,
    city,
    browser
FROM
(
    SELECT
        ['Istanbul', 'Berlin', 'Bobruisk'] AS cities,
        ['Firefox', 'Chrome', 'Chrome'] AS browsers
)
ARRAY JOIN
    cities AS city,
    browsers AS browser
JOIN (SELECT 'Istanbul' AS city, 'foo' AS bar) cte USING city
GROUP BY
    city,
    browser
SETTINGS joined_subquery_requires_alias=0

Note, in this example trying to add the column cte.bar to the output resulted in an error even though the result is filtered correctly, so perhaps this is just a bad thing to do in Clickhouse. I'm also not sure I've actually seen this done. I think there are two options:

  1. Be opinionated and raise an error for mixed join types as in my example above, but accept an ARRAY JOIN in isolation.
  2. Assume the user knows what they're doing and accept the above example.

I would be in favour of option 1. If it becomes a problem we can always change it.

As an aside, it seems you can now achieve the same result using Tuple as described at the bottom of this page.