Open TechAuditBI opened 10 months ago
@TechAuditBI ideally from a UI/UX perspective it would be great if there was only one SQL field which the backend would parse—be that via sqlparse
or sqlglot
—to extract the pre-SQL.
cc @betodealmeida
Yeah, I agree that automatically splitting the CTE from the query is a better UX.
In general, I think it would be good if Superset had an understanding of the virtual dataset when exploring it. For example, if I have a virtual dataset like this:
SELECT country, COUNT(*)
FROM t
GROUP BY 1
It would be nice when creating a chart Superset parsed the query and understood that there are metrics and dimensions in it already. Maybe there's a mode that would automatically populate the UI so that COUNT(*)
shows as a metric, and country
shows as a selected dimension. Today, Superset would try to build another aggregation on top of this aggregation, which maybe is what the user want, but in my experience going from SQL to viz you really want to explore the unaggregated query, and be able to modify it in the UI.
Using a parser to move the CTE around when exploring the virtual dataset sounds like a good first step in that direction.
To me, the bigger problem of requiring a pre_sql
is that it would make it cumbersome to go directly from SQL Lab to visualization, requiring the user to manually split their query.
It would be nice when creating a chart Superset parsed the query and understood that there are metrics and dimensions in it already. Maybe there's a mode that would automatically populate the UI so that
COUNT(*)
shows as a metric, andcountry
shows as a selected dimension. Today, Superset would try to build another aggregation on top of this aggregation, which maybe is what the user want, but in my experience going from SQL to viz you really want to explore the unaggregated query, and be able to modify it in the UI.
@betodealmeida Speaking of parsing a query into metrics and dimensions. I think that such level of parsing would actually reduce resulting functionality. Because if I want to calculate COUNT(*) grouped by country I could simply use the table "t" as a dataset. But if my goal was to actually aggregate my data before creating a viz I would not be able to do that anymore.
@rusackas @john-bodley I've updated the description. Please check whether I've skipped any important parts.
@betodealmeida Speaking of parsing a query into metrics and dimensions. I think that such level of parsing would actually reduce resulting functionality. Because if I want to calculate COUNT(*) grouped by country I could simply use the table "t" as a dataset. But if my goal was to actually aggregate my data before creating a viz I would not be able to do that anymore.
Right, but sometimes you aggregate your data too much before going to viz, or you realize only then that a dimension that was not included is also important, so it would be nice to be able to click a button and change that. I'm not sure how the UI/UX would work, or even if it's something users really want, but it's something we'd be able to do.
We've just realised that this parsing functionality would actually allow users to create temp tables in their virtual datasets and those temp table could be reused by charts running queries later. So this would be some kind of "dataset caching".
P.S. I'm the author of this PR this is just my personal account
@supersetbot orglabel
This has now been brought up for official discussion via the dev@ list.
Any solution that requires parsing SQL is bound to intricately fail at times. Given that another idea would be to have the user be explicit about what goes above the SELECT statement. You could imagine having two TEXTAREAs in the dataset editor, say one label pre-query
and one labeled sub-query
(we'd probably need a better explanation for this, but that's one option).
Another option would be to use hints and have the parse look for hints
--__PRE_QUERY_BLOCK__
DECLARE stuff
WITH crazy_sub_query AS (
SELECT * from bananas
)
--__END_BLOCK__
SELECT * FROM crazy_sub_query
Then even today you could have the config.SQL_QUERY_MUTATOR
do something like (this was GPT-generated, didn't take time to validate)
import re
def extract_and_move_block(sql):
# Define the block markers
start_marker = "--__PRE_QUERY_BLOCK__"
end_marker = "--__END_BLOCK__"
# Find the start and end indices of the block
start_index = sql.find(start_marker)
end_index = sql.find(end_marker)
# If both markers are found
if start_index != -1 and end_index != -1:
# Extract the block content
block_content = sql[start_index + len(start_marker):end_index].strip()
# Remove the block from the original SQL
sql_without_block = sql[:start_index] + sql[end_index + len(end_marker):]
# Move the block to the top of the SQL
new_sql = start_marker + "\n" + block_content + "\n" + end_marker + "\n" + sql_without_block
return new_sql
else:
return sql
# Example usage
sql_input = """
--__PRE_QUERY_BLOCK__
DECLARE stuff
WITH crazy_sub_query AS (
SELECT * from bananas
)
--__END_BLOCK__
SELECT * FROM crazy_sub_query
"""
new_sql = extract_and_move_block(sql_input)
print(new_sql)
Does this need to be put up for a VOTE? Let me know if you think it's ready, and I'll be happy to do so.
From my perspective the SIP is lacking important details. Allowing people to run whatever is a pre-query block might not be something that administrators want to allow.
CREATE TABLE IF NOT EXISTS super_massive_unefficient_analyst_query_that_scans_a_petabyte_table AS
SELECT * FROM massive_petabyte_table;
For me to vote positive, there should be a database configuration flag (off by default) that's called something like ENABLE_PRE_QUERY_BLOCK_FOR_VIRTUAL_DATASETS: false
(could be in extra_metdata json blob).
Or maybe it fits in here:
Oh, that an a better separator, maybe --__QUERY_START__
We have implemented this SIP in our PR. https://github.com/apache/superset/pull/30518
On either the SIP or the PR, could you add the details requested by others? The PR is quite helpful in moving the discussion forward, but we need more PR description (examples, edge cases, etc) and on the SIP we need more discussion addressing @mistercrunch 's comments and considering alternatives (e.g. running a Saved Query as a pre-flight step for a virtual dataset).
Hopefully with the above taken care of, we can consider this ready for a Vote and get the PR merged.
[SIP-115] Proposal for Pre SQL block for virtual datasets
Motivation
Currently huge piece of SQL functionality couldn't be used in virtual datasets due to the fact that dataset query is being put into from section in a pattern "from ("SQL query") as virtual_table. This limits plenty of SQL dialects from using non SELECT statements in a dataset query including for example temp tables, variables or WITH statements. For example Postgre SQL does support WITH statements inside of a SELECT statement but NS SQL does not. Also it would be very useful to be able to actually create temporary tables in a dataset query but as far as I know there is no SQL dialect that supports CREATE statements inside of SELECT ones.
P.S. Actually the whole idea is to make it so that any possible query that can be completed in sql lab should be a valid query for a virtual dataset.
Proposed Change
The change that is being proposed is to parse SQL statement under the hood to distinguish actual final SELECT statement from all the preceding auxillary statements and put the auxillary part in the beginning of every query being built by viz plugins later.
So user should be able to use the following query as a virtual dataset query:
And everything that is above the split point should be at least automaticcaly put at the beginning of every query generated by viz plugins.
New or Changed Public Interfaces
None should be needed.
New dependencies
To be filled.
Migration Plan and Compatibility
To be filled.
Rejected Alternatives
Rejected due to overcomplicating the UI while also making it less intuitive and userfriendly.