cube-js / cube

πŸ“Š Cube β€” The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.94k stars 1.78k forks source link

Using FILTER_PARAMS causes filters to be duplicated in WHERE clauses #3477

Open tchell opened 3 years ago

tchell commented 3 years ago

Describe the bug When using FILTER_PARAMS to filter a cube, the attribute that is being filtered gets added to the SQL twice.

To Reproduce

// cube.js
module.exports = {
  queryRewrite: (query, { ... }) => {
    query.filters.push({
      member: 'Order.closed',
      operator: 'equals',
      values: ['true'],
    });
    return query;
  }
};

// schema/Order.js
cube('Order, {
  sql: `SELECT * FROM order WHERE ${FILTER_PARAMS.Order.closed.filter('closed')}`,
  measures: {
    amount: {
      sql: 'amount',
      type: `sum`,
    },
  dimensions: {
    closed: {
      type: `boolean`,
      sql: `closed`,
   },
};

// query
{
  measures: 'Order.amount',
}

Generates the following sql:

SELECT sum(order.amount)
FROM
  (
    SELECT *
    FROM order
    WHERE order.closed = $1
  )
WHERE
  order.closed = $2

Expected behavior

SELECT sum(order.amount
    FROM order
    WHERE order.closed = $1

OR (at least)

SELECT sum(order.amount)
FROM
  (
    SELECT *
    FROM order
    WHERE order.closed = $1
  )

Version: "@cubejs-backend/postgres-driver": "^0.28.19", "@cubejs-backend/server": "^0.28.18", "@cubejs-backend/server-core": "^0.28.18"

paveltiunov commented 3 years ago

@tchell Hey Tanner! Could you please elaborate on your use case? It seems the example is oversimplified here.

tchell commented 3 years ago

OK. I'm not totally sure what more there is to expand on. I'm pretty sure my example is following almost exactly how the docs indicates. And it seems like the sql generation is incorrect, since the filter is being done twice. Although the example provided is not exactly my use case I think it is sufficient to explain the error is it not?

My use case is that I'm trying to filter a cube by first joining with another that identifies if access is allowed to a given item in the cube:

cube('Order', {
  sql: `SELECT order.*, user.email as user_email
          FROM order JOIN user ON order.user_id = user.id
          WHERE ${FILTER_PARAMS.Order.user_email.filter('user.email')}`,
  ...
  dimensions: {
    user_email: {
      type: `string`,
      sql: `user_email`,
    },
  ....
}

// query
{
  measures: 'Order.amount',
  filters: [
    {
      member: 'Order.user_email',
      operator: 'equals',
      values: ['example@email.com'],
    },
}

In this case the SQL generated looks like:

SELECT sum(order_filtered.amount)
FROM
  (
    SELECT order.*, user.email as user_email
    FROM order ON order.user_id = user.id
    WHERE user.email = $1
  ) as order_filtered
WHERE order_filtered.user_email = $2

So the filter is being applied to the query twice when it should ideally only be used once. Is this clear enough? Sorry if not, I'm honestly not really sure what more information you need here.

paveltiunov commented 3 years ago

@tchell Hey Tanner! This join detail makes much more sense. And why this outer filter bothers you? Is it because of performance concerns?

tchell commented 3 years ago

Yeah, realistically it shouldn't have any different results. Likely the only problem would be performance related (if that even). But to me it just seems like an odd thing that would ideally be removed since it's unnecessary and confusing when you see it.

paveltiunov commented 3 years ago

@tchell That makes sense. It's for sure redundant but most databases just push down it and execute only once. For those which doesn't you most probably should use pre-aggregations if performance is a concern. We also have an experimental rewriteQueries feature that parses SQL and pushes down filters by rewriting SQL. It can be extended to meet this requirement of not having duplicated filter.

github-actions[bot] commented 3 years ago

If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.

tchell commented 2 years ago

Just wanted to follow up with something new that might make this a bit more important. If are joining the cube with another this will cause potentially undesirable behaviour. Every join in cubejs is done with a left join to allow null values to be included in the results. However, this behaviour does not allow that to happen because the second added clause will remove any null values. I'd argue this is undesirable because it seems like the point of using FILTER_PARAMS in the cube sql definition is to only filter the underlying cube before any joins and then the normal query behaviour should I can try working on this if I can find some time. But if anyone reading this happens to know how to discover what the FILTER_PARAMS are I'd appreciate it, since in my initial look through I haven't seen the obvious way yet...

tomsseisums commented 1 year ago

Related: https://github.com/cube-js/cube.js/issues/640

absoluteharam commented 9 months ago

Hi. Is there any method in cube configuration to remove the duplicate where statement as stated in this issue? Facing the same duplicate where clause statement as well.

ChrisLahaye commented 8 months ago

Is there currently a way to remove these duplicate clauses and optimize the query?

igorlukanin commented 8 months ago

@ChrisLahaye Are you sure that the generated query is not optimal? Could you please provide more details?

As @paveltiunov mentioned in his comment above, most databases and data warehouses push the outer filter down and perform filtering only onceβ€”so, there's no actual performance hit or "query deoptimization" happening.

casab commented 8 months ago

Hey @igorlukanin, we are using Cubejs on top of Clickhouse which has a postgresql connection in it(in the ex, it's the facts db). To optimize the Cubes that has static dimensions and metrics which our dashboard tables uses, we had written something like this;

SELECT
  "apm".*,
  "stats".*,
  ${mockTime(FILTER_PARAMS, 'advertiserProductMetrics')} as time
  FROM facts.advertiser_product_metrics "apm"
    LEFT JOIN (
      ${advertiserProductMetricSubquery(
        FILTER_PARAMS,
        'advertiserProductMetrics',
        ['campaign_id', 'product_resource_id']
      )}
    ) AS "stats" ON toInt64 ("apm".campaign_id) = toInt64 ("stats".campaign_id) AND "apm".resource_id = "stats".product_resource_id

There is this line in the advertiserProductMetricSubquery

WHERE ${filter_params[viewName].time.filter('time')}

And this helps us filter the events in the Clickhouse table before they are joined with postgresql tables. When using subqueries like this, pushdown from Clickhouse side is not really optimal. But to make that work we had to add this mockTime function which always needs to yield to true for the given time filter on Clickhouse. So we wrote this;

exports.mockTime = (filter_params, viewName) => {
  if (filter_params[viewName].time.filter('time') == '1 = 1') {
    return `1`
  }
  return `addSeconds(${filter_params[viewName].time.filter(
    (from, to) => 'parseDateTimeBestEffort(' + from + ')'
  )}, 1)`
}

Which is broken with the following commit; https://github.com/cube-js/cube/commit/736070d5a33c25d212420a61da6c98a4b4d31188

So if there was an option to disable duplicated WHERE clause, there would be no need to add such a hideous function πŸ˜…

sufiyangorgias commented 8 months ago

Is there currently a way to remove these duplicate clauses, in my case

select 
        accountId,
        count(distinct uuid) interactions,
        toDate(${SQL_UTILS.convertTz('created_datetime')}) createdDate
        from my_table
        WHERE
            ${FILTER_PARAMS.Dataset.createdDatetime.filter(
                'created_datetime',
            )} AND
            ${FILTER_PARAMS.Dataset.eventType.filter(
                'event_type',
            )} AND
            ${FILTER_PARAMS.Dataset.channel.filter('channel')} 

        group by 
            accountId, createdDate

In this case I really dont want to apply the second filter, because there is no such columns exists after the group by that I extended from the other cube.

casab commented 8 months ago

@sufiyangorgias The way I do it is to wrap this in a subquery, and create mock columns that will always yield the duplicated filters to TRUE. For example you can write something like

SELECT
    sq.*,
    ${FILTER_PARAMS.Dataset.eventType.filter('event_type')} as event_type,
    ${FILTER_PARAMS.Dataset.createdDatetime.filter('created_datetime')} as created_datetime
    ${FILTER_PARAMS.Dataset.channel.filter('channel')}  as channel
FROM
    (your query as subquery) as sq

But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that convertTz.

sufiyangorgias commented 8 months ago

Hey @casab Haha, that's a clever workaround! Thanks a bunch for the tip, I'll be sure to give it a try. You're a real SQL ninja! πŸ˜„πŸ™Œ.

Regarding this

But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that convertTz.

I can't remove the timezone conversion because it affects the date's accuracy due to my setup. However, I'm interested in preserving the granularity feature provided by the time dimension filters. Do you know Is there a way to bypass the parent timezone conversion while still utilizing these time dimension filters? Your insights would be greatly appreciated.

casab commented 7 months ago

Is there any news or plans on this? At least give us the ability to delete it from filters after use.

simonedbarber commented 1 month ago

@sufiyangorgias The way I do it is to wrap this in a subquery, and create mock columns that will always yield the duplicated filters to TRUE. For example you can write something like

SELECT
    sq.*,
    ${FILTER_PARAMS.Dataset.eventType.filter('event_type')} as event_type,
    ${FILTER_PARAMS.Dataset.createdDatetime.filter('created_datetime')} as created_datetime
    ${FILTER_PARAMS.Dataset.channel.filter('channel')}  as channel
FROM
    (your query as subquery) as sq

But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that convertTz.

@tchell I've also had to do similar to the above where we have more complex filters, so that I can implement Geospatial query filter with user input values (for the where clause for now).

The below is our crude solution for the REST and GraphQL (untested) APIs only.

Enabling a set of pass through functions for query pushdown, that aligns to each DB syntax would also be amazing for the cube product.

In the long term adding aliasing/normailsation the postgres SQL syntax to the SQL syntax of the connected DB for common functions would be even better!

Here's my example.

cube(`Towns`, {
  sql: `SELECT Towns.*,
  ${FILTER_PARAMS.Towns.geometry_point_intersects_filter.filter((value) => {
    return `CAST(${value} AS TEXT ) `;
  })} as "geometry_point_filter"
  FROM (
   SELECT * FROM "towns"
    WHERE 
    ${FILTER_PARAMS.Towns.geometry_point_intersects_filter.filter(
      (geometry_point_intersects_filter) => {
        return `ST_Intersects("geometry_point", ST_GeomFromGeoJSON( ${geometry_point_intersects_filter} ))`;
      }
    )}
  ) as Towns
  `,
  ...
  }
)

cube.geometry_point_filter is one of many special dimensions that have meta tags

geometry_point: {
  sql: `${CUBE}.ST_AsGeoJSON(ST_Transform(${CUBE}."geom",4326))`,
  type: `string`,
  meta: {
    type: `geometry`,
  },
},
geometry_point_intersects_filter: {
  sql: `${CUBE}."geometry_point_intersects_filter"`,
  type: `string`,
  meta: {
    type: `geometry`,
    filter: `Intersects`,
    member: `geometry_point`,
  },
},

in the frontend this links and exposes additional filter types to geometry_point not natively supported by cube.

Note: this is from crude alpha code (and i've tried to clean up examples in the editor so there may be some small typos), but I hope there's a better solution that I've missed!

If there was an exclude/ignore filter option for where/having clauses; then the outer query, requiring the outer geometry_point_filter to be equal to the input string (i.e. resolves to true "input_string"="input_string"), could be omitted. Note: this is a minimal example, I have many more input filter types.

What would be a far more scalable solution for cube going forward (in my opinion), would be to allow for:

  1. a rule to disable the default filters for a dimension/measure
  2. a new key for each dimension/measure to add custom filters

Something like:


    geometry_point: {
      sql: ` ST_AsGeoJSON(ST_Transform(${CUBE}."geometry_point",4326))`,
      type: `string`,
      filters: {
        disable: true
        custom:
          [
            {
              name: `Intersects`
              clause: `ST_Intersects(${CUBE}."geometry_point", ST_GeomFromGeoJSON( ${ value } ))`
            }
          ]
      }
    },

These are just my crude ideas, and ${ value } and the key names are definitely not explicit enough.

@igorlukanin and @paveltiunov is this a feature you're already considering, or that you may consider? I'd love to collaborate and refine a solution for this as It seems like a key capability that many are seeking.

For various reasons, my team is currently scaled down so can't jump on it now, but we're about to scale back up and we're happy to do what we can to assist on this as we get closer to needing to roll this out on many models. I'll hopefully be able to throw a bit of dev support behind it when we're scaled back up and start to build out the key features that will build on top of this requirement early next year if we can agreement on a clear path forward.

It also may partially resolve: https://github.com/cube-js/cube/issues/3716