[SIP-115] Parsing SQL expression block for virtual datasets into separate CTE and SELECT statements

TechAuditBI commented 10 months ago

[SIP-115] Proposal for Pre SQL block for virtual datasets

Motivation

Currently huge piece of SQL functionality couldn't be used in virtual datasets due to the fact that dataset query is being put into from section in a pattern "from ("SQL query") as virtual_table. This limits plenty of SQL dialects from using non SELECT statements in a dataset query including for example temp tables, variables or WITH statements. For example Postgre SQL does support WITH statements inside of a SELECT statement but NS SQL does not. Also it would be very useful to be able to actually create temporary tables in a dataset query but as far as I know there is no SQL dialect that supports CREATE statements inside of SELECT ones.

P.S. Actually the whole idea is to make it so that any possible query that can be completed in sql lab should be a valid query for a virtual dataset.

Proposed Change

The change that is being proposed is to parse SQL statement under the hood to distinguish actual final SELECT statement from all the preceding auxillary statements and put the auxillary part in the beginning of every query being built by viz plugins later.

So user should be able to use the following query as a virtual dataset query:

DROP TABLE IF EXISTS #tmp;

DECLARE @t1 TABLE (itmID INT, itmIDComp INT);
INSERT @t1 VALUES (1,10), (2,10);

DECLARE @t2 TABLE (itmID INT, itmIDComp INT);
INSERT @t2 VALUES (3,10), (4,10);

WITH vw AS
(
    SELECT itmIDComp, itmID
    FROM @t1

    UNION ALL

    SELECT itmIDComp, itmID
    FROM @t2
)
, r AS
(
    SELECT t.itmID AS itmIDComp
           , NULL AS itmID
           , CAST(0 AS BIGINT) AS N
           , 1 AS Lvl
    FROM (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4) AS t (itmID)

UNION ALL

SELECT t.itmIDComp
    , t.itmID
    , ROW_NUMBER() OVER(PARTITION BY t.itmIDComp ORDER BY t.itmIDComp, t.itmID) AS N
    , Lvl + 1
FROM r
    JOIN vw AS t ON t.itmID = r.itmIDComp
)

CREATE TEMP TABLE #tmp
(Lvl INT, N INT);

INSERT INTO #tmp
SELECT Lvl, N FROM r;

----------------SPLIT POINT------------------

SELECT 
   Lvl,
   MAX(N) as max_n
FROM #tmp
GROUP BY
   Lvl;

And everything that is above the split point should be at least automaticcaly put at the beginning of every query generated by viz plugins.

New or Changed Public Interfaces

None should be needed.

New dependencies

To be filled.

Migration Plan and Compatibility

To be filled.

Rejected Alternatives

Rejected due to overcomplicating the UI while also making it less intuitive and userfriendly.

Proposed change is to include a new "Pre SQL" field that would contain SQL that would be put before all SELECT statements for every query on that dataset.

john-bodley commented 10 months ago

@TechAuditBI ideally from a UI/UX perspective it would be great if there was only one SQL field which the backend would parse—be that via sqlparse or sqlglot—to extract the pre-SQL.

cc @betodealmeida

betodealmeida commented 10 months ago

Yeah, I agree that automatically splitting the CTE from the query is a better UX.

In general, I think it would be good if Superset had an understanding of the virtual dataset when exploring it. For example, if I have a virtual dataset like this:

SELECT country, COUNT(*)
FROM t
GROUP BY 1

It would be nice when creating a chart Superset parsed the query and understood that there are metrics and dimensions in it already. Maybe there's a mode that would automatically populate the UI so that COUNT(*) shows as a metric, and country shows as a selected dimension. Today, Superset would try to build another aggregation on top of this aggregation, which maybe is what the user want, but in my experience going from SQL to viz you really want to explore the unaggregated query, and be able to modify it in the UI.

Using a parser to move the CTE around when exploring the virtual dataset sounds like a good first step in that direction.

To me, the bigger problem of requiring a pre_sql is that it would make it cumbersome to go directly from SQL Lab to visualization, requiring the user to manually split their query.

TechAuditBI commented 9 months ago

It would be nice when creating a chart Superset parsed the query and understood that there are metrics and dimensions in it already. Maybe there's a mode that would automatically populate the UI so that COUNT(*) shows as a metric, and country shows as a selected dimension. Today, Superset would try to build another aggregation on top of this aggregation, which maybe is what the user want, but in my experience going from SQL to viz you really want to explore the unaggregated query, and be able to modify it in the UI.

@betodealmeida Speaking of parsing a query into metrics and dimensions. I think that such level of parsing would actually reduce resulting functionality. Because if I want to calculate COUNT(*) grouped by country I could simply use the table "t" as a dataset. But if my goal was to actually aggregate my data before creating a viz I would not be able to do that anymore.

TechAuditBI commented 9 months ago

@rusackas @john-bodley I've updated the description. Please check whether I've skipped any important parts.

betodealmeida commented 9 months ago

@betodealmeida Speaking of parsing a query into metrics and dimensions. I think that such level of parsing would actually reduce resulting functionality. Because if I want to calculate COUNT(*) grouped by country I could simply use the table "t" as a dataset. But if my goal was to actually aggregate my data before creating a viz I would not be able to do that anymore.

Right, but sometimes you aggregate your data too much before going to viz, or you realize only then that a dimension that was not included is also important, so it would be nice to be able to click a button and change that. I'm not sure how the UI/UX would work, or even if it's something users really want, but it's something we'd be able to do.

ETselikov commented 9 months ago

We've just realised that this parsing functionality would actually allow users to create temp tables in their virtual datasets and those temp table could be reused by charts running queries later. So this would be some kind of "dataset caching".

P.S. I'm the author of this PR this is just my personal account

TechAuditBI commented 7 months ago

@supersetbot orglabel

rusackas commented 6 months ago

This has now been brought up for official discussion via the dev@ list.

mistercrunch commented 6 months ago

Any solution that requires parsing SQL is bound to intricately fail at times. Given that another idea would be to have the user be explicit about what goes above the SELECT statement. You could imagine having two TEXTAREAs in the dataset editor, say one label pre-query and one labeled sub-query (we'd probably need a better explanation for this, but that's one option).

Another option would be to use hints and have the parse look for hints

--__PRE_QUERY_BLOCK__
DECLARE stuff
WITH crazy_sub_query AS (
  SELECT * from bananas
)
--__END_BLOCK__
SELECT * FROM crazy_sub_query

Then even today you could have the config.SQL_QUERY_MUTATOR do something like (this was GPT-generated, didn't take time to validate)

import re

def extract_and_move_block(sql):
    # Define the block markers
    start_marker = "--__PRE_QUERY_BLOCK__"
    end_marker = "--__END_BLOCK__"

    # Find the start and end indices of the block
    start_index = sql.find(start_marker)
    end_index = sql.find(end_marker)

    # If both markers are found
    if start_index != -1 and end_index != -1:
        # Extract the block content
        block_content = sql[start_index + len(start_marker):end_index].strip()

        # Remove the block from the original SQL
        sql_without_block = sql[:start_index] + sql[end_index + len(end_marker):]

        # Move the block to the top of the SQL
        new_sql = start_marker + "\n" + block_content + "\n" + end_marker + "\n" + sql_without_block

        return new_sql
    else:
        return sql

# Example usage
sql_input = """
--__PRE_QUERY_BLOCK__
DECLARE stuff
WITH crazy_sub_query AS (
  SELECT * from bananas
)
--__END_BLOCK__
SELECT * FROM crazy_sub_query
"""

new_sql = extract_and_move_block(sql_input)
print(new_sql)

rusackas commented 3 months ago

Does this need to be put up for a VOTE? Let me know if you think it's ready, and I'll be happy to do so.

mistercrunch commented 1 month ago

From my perspective the SIP is lacking important details. Allowing people to run whatever is a pre-query block might not be something that administrators want to allow.

CREATE TABLE IF NOT EXISTS super_massive_unefficient_analyst_query_that_scans_a_petabyte_table AS 
SELECT * FROM massive_petabyte_table;

For me to vote positive, there should be a database configuration flag (off by default) that's called something like ENABLE_PRE_QUERY_BLOCK_FOR_VIRTUAL_DATASETS: false (could be in extra_metdata json blob).

Or maybe it fits in here:

Oh, that an a better separator, maybe --__QUERY_START__

mcdogg17 commented 1 month ago

We have implemented this SIP in our PR. https://github.com/apache/superset/pull/30518

rusackas commented 4 weeks ago

On either the SIP or the PR, could you add the details requested by others? The PR is quite helpful in moving the discussion forward, but we need more PR description (examples, edge cases, etc) and on the SIP we need more discussion addressing @mistercrunch 's comments and considering alternatives (e.g. running a Saved Query as a pre-flight step for a virtual dataset).

Hopefully with the above taken care of, we can consider this ready for a Vote and get the PR merged.

apache / superset