Open tchell opened 3 years ago
@tchell Hey Tanner! Could you please elaborate on your use case? It seems the example is oversimplified here.
OK. I'm not totally sure what more there is to expand on. I'm pretty sure my example is following almost exactly how the docs indicates. And it seems like the sql generation is incorrect, since the filter is being done twice. Although the example provided is not exactly my use case I think it is sufficient to explain the error is it not?
My use case is that I'm trying to filter a cube by first joining with another that identifies if access is allowed to a given item in the cube:
cube('Order', {
sql: `SELECT order.*, user.email as user_email
FROM order JOIN user ON order.user_id = user.id
WHERE ${FILTER_PARAMS.Order.user_email.filter('user.email')}`,
...
dimensions: {
user_email: {
type: `string`,
sql: `user_email`,
},
....
}
// query
{
measures: 'Order.amount',
filters: [
{
member: 'Order.user_email',
operator: 'equals',
values: ['example@email.com'],
},
}
In this case the SQL generated looks like:
SELECT sum(order_filtered.amount)
FROM
(
SELECT order.*, user.email as user_email
FROM order ON order.user_id = user.id
WHERE user.email = $1
) as order_filtered
WHERE order_filtered.user_email = $2
So the filter is being applied to the query twice when it should ideally only be used once. Is this clear enough? Sorry if not, I'm honestly not really sure what more information you need here.
@tchell Hey Tanner! This join detail makes much more sense. And why this outer filter bothers you? Is it because of performance concerns?
Yeah, realistically it shouldn't have any different results. Likely the only problem would be performance related (if that even). But to me it just seems like an odd thing that would ideally be removed since it's unnecessary and confusing when you see it.
@tchell That makes sense. It's for sure redundant but most databases just push down it and execute only once. For those which doesn't you most probably should use pre-aggregations if performance is a concern. We also have an experimental rewriteQueries
feature that parses SQL and pushes down filters by rewriting SQL. It can be extended to meet this requirement of not having duplicated filter.
If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.
Just wanted to follow up with something new that might make this a bit more important. If are joining the cube with another this will cause potentially undesirable behaviour. Every join in cubejs is done with a left join to allow null values to be included in the results. However, this behaviour does not allow that to happen because the second added clause will remove any null values. I'd argue this is undesirable because it seems like the point of using FILTER_PARAMS in the cube sql definition is to only filter the underlying cube before any joins and then the normal query behaviour should I can try working on this if I can find some time. But if anyone reading this happens to know how to discover what the FILTER_PARAMS are I'd appreciate it, since in my initial look through I haven't seen the obvious way yet...
Hi. Is there any method in cube configuration to remove the duplicate where statement as stated in this issue? Facing the same duplicate where clause statement as well.
Is there currently a way to remove these duplicate clauses and optimize the query?
@ChrisLahaye Are you sure that the generated query is not optimal? Could you please provide more details?
As @paveltiunov mentioned in his comment above, most databases and data warehouses push the outer filter down and perform filtering only onceβso, there's no actual performance hit or "query deoptimization" happening.
Hey @igorlukanin, we are using Cubejs on top of Clickhouse which has a postgresql connection in it(in the ex, it's the facts db). To optimize the Cubes that has static dimensions and metrics which our dashboard tables uses, we had written something like this;
SELECT
"apm".*,
"stats".*,
${mockTime(FILTER_PARAMS, 'advertiserProductMetrics')} as time
FROM facts.advertiser_product_metrics "apm"
LEFT JOIN (
${advertiserProductMetricSubquery(
FILTER_PARAMS,
'advertiserProductMetrics',
['campaign_id', 'product_resource_id']
)}
) AS "stats" ON toInt64 ("apm".campaign_id) = toInt64 ("stats".campaign_id) AND "apm".resource_id = "stats".product_resource_id
There is this line in the advertiserProductMetricSubquery
WHERE ${filter_params[viewName].time.filter('time')}
And this helps us filter the events in the Clickhouse table before they are joined with postgresql tables. When using subqueries like this, pushdown from Clickhouse side is not really optimal. But to make that work we had to add this mockTime function which always needs to yield to true for the given time filter on Clickhouse. So we wrote this;
exports.mockTime = (filter_params, viewName) => {
if (filter_params[viewName].time.filter('time') == '1 = 1') {
return `1`
}
return `addSeconds(${filter_params[viewName].time.filter(
(from, to) => 'parseDateTimeBestEffort(' + from + ')'
)}, 1)`
}
Which is broken with the following commit; https://github.com/cube-js/cube/commit/736070d5a33c25d212420a61da6c98a4b4d31188
So if there was an option to disable duplicated WHERE clause, there would be no need to add such a hideous function π
Is there currently a way to remove these duplicate clauses, in my case
select
accountId,
count(distinct uuid) interactions,
toDate(${SQL_UTILS.convertTz('created_datetime')}) createdDate
from my_table
WHERE
${FILTER_PARAMS.Dataset.createdDatetime.filter(
'created_datetime',
)} AND
${FILTER_PARAMS.Dataset.eventType.filter(
'event_type',
)} AND
${FILTER_PARAMS.Dataset.channel.filter('channel')}
group by
accountId, createdDate
In this case I really dont want to apply the second filter, because there is no such columns exists after the group by that I extended from the other cube.
@sufiyangorgias The way I do it is to wrap this in a subquery, and create mock columns that will always yield the duplicated filters to TRUE. For example you can write something like
SELECT
sq.*,
${FILTER_PARAMS.Dataset.eventType.filter('event_type')} as event_type,
${FILTER_PARAMS.Dataset.createdDatetime.filter('created_datetime')} as created_datetime
${FILTER_PARAMS.Dataset.channel.filter('channel')} as channel
FROM
(your query as subquery) as sq
But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that convertTz
.
Hey @casab Haha, that's a clever workaround! Thanks a bunch for the tip, I'll be sure to give it a try. You're a real SQL ninja! ππ.
Regarding this
But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that convertTz.
I can't remove the timezone conversion because it affects the date's accuracy due to my setup. However, I'm interested in preserving the granularity feature provided by the time dimension filters. Do you know Is there a way to bypass the parent timezone conversion while still utilizing these time dimension filters? Your insights would be greatly appreciated.
Is there any news or plans on this? At least give us the ability to delete it from filters after use.
@sufiyangorgias The way I do it is to wrap this in a subquery, and create mock columns that will always yield the duplicated filters to TRUE. For example you can write something like
SELECT sq.*, ${FILTER_PARAMS.Dataset.eventType.filter('event_type')} as event_type, ${FILTER_PARAMS.Dataset.createdDatetime.filter('created_datetime')} as created_datetime ${FILTER_PARAMS.Dataset.channel.filter('channel')} as channel FROM (your query as subquery) as sq
But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that
convertTz
.
@tchell I've also had to do similar to the above where we have more complex filters, so that I can implement Geospatial query filter with user input values (for the where clause for now).
The below is our crude solution for the REST and GraphQL (untested) APIs only.
Enabling a set of pass through functions for query pushdown, that aligns to each DB syntax would also be amazing for the cube product.
In the long term adding aliasing/normailsation the postgres SQL syntax to the SQL syntax of the connected DB for common functions would be even better!
Here's my example.
cube(`Towns`, {
sql: `SELECT Towns.*,
${FILTER_PARAMS.Towns.geometry_point_intersects_filter.filter((value) => {
return `CAST(${value} AS TEXT ) `;
})} as "geometry_point_filter"
FROM (
SELECT * FROM "towns"
WHERE
${FILTER_PARAMS.Towns.geometry_point_intersects_filter.filter(
(geometry_point_intersects_filter) => {
return `ST_Intersects("geometry_point", ST_GeomFromGeoJSON( ${geometry_point_intersects_filter} ))`;
}
)}
) as Towns
`,
...
}
)
cube.geometry_point_filter is one of many special dimensions that have meta tags
geometry_point: {
sql: `${CUBE}.ST_AsGeoJSON(ST_Transform(${CUBE}."geom",4326))`,
type: `string`,
meta: {
type: `geometry`,
},
},
geometry_point_intersects_filter: {
sql: `${CUBE}."geometry_point_intersects_filter"`,
type: `string`,
meta: {
type: `geometry`,
filter: `Intersects`,
member: `geometry_point`,
},
},
in the frontend this links and exposes additional filter types to geometry_point
not natively supported by cube.
Note: this is from crude alpha code (and i've tried to clean up examples in the editor so there may be some small typos), but I hope there's a better solution that I've missed!
If there was an exclude/ignore filter option for where/having clauses; then the outer query, requiring the outer geometry_point_filter
to be equal to the input string (i.e. resolves to true "input_string"="input_string"), could be omitted. Note: this is a minimal example, I have many more input filter types.
What would be a far more scalable solution for cube going forward (in my opinion), would be to allow for:
Something like:
geometry_point: {
sql: ` ST_AsGeoJSON(ST_Transform(${CUBE}."geometry_point",4326))`,
type: `string`,
filters: {
disable: true
custom:
[
{
name: `Intersects`
clause: `ST_Intersects(${CUBE}."geometry_point", ST_GeomFromGeoJSON( ${ value } ))`
}
]
}
},
These are just my crude ideas, and ${ value } and the key names are definitely not explicit enough.
@igorlukanin and @paveltiunov is this a feature you're already considering, or that you may consider? I'd love to collaborate and refine a solution for this as It seems like a key capability that many are seeking.
For various reasons, my team is currently scaled down so can't jump on it now, but we're about to scale back up and we're happy to do what we can to assist on this as we get closer to needing to roll this out on many models. I'll hopefully be able to throw a bit of dev support behind it when we're scaled back up and start to build out the key features that will build on top of this requirement early next year if we can agreement on a clear path forward.
It also may partially resolve: https://github.com/cube-js/cube/issues/3716
Describe the bug When using FILTER_PARAMS to filter a cube, the attribute that is being filtered gets added to the SQL twice.
To Reproduce
Generates the following sql:
Expected behavior
OR (at least)
Version: "@cubejs-backend/postgres-driver": "^0.28.19", "@cubejs-backend/server": "^0.28.18", "@cubejs-backend/server-core": "^0.28.18"