dwreeves / dbt_linreg

Linear regression in SQL using dbt
MIT License
64 stars 2 forks source link

`covar_pop` not supported in all databases. #17

Open dwreeves opened 3 months ago

dwreeves commented 3 months ago

Redshift does not handle either covar or covar_pop, which means default__regress doesn't work. There may be other databases where this is true.

Current solution

The best solution is to use the chol method, which never relies on the default__regress macro. So the problem is fixed by simply using the suggested method for calculating regression coefficients.

I will make a note in the README about this.

If a user utilizes the fwl method, in the case of databases that do not support covar_pop, a compilation error should be raised, since it is not possible to calculate without two select statements. The compilation error should suggest using chol.

How to fix for real

This is actually fixable.

The problem is, frankly, I don't have the wherewithal these days to support this project in excruciating depth. (Sorry, I'm busy on other things.) Because this issue is not something that should impact Redshift users unless they deviate from the suggested method of using chol, I don't consider this critical enough to spend time on.

If anyone reading this wants to fix it themselves, there is a solution if you are willing to contribute a PR. For databases that do not support covar_pop, an additional preprocessing step needs to be done. It looks kinda like this:

with

step1 as (

  select
    grp,
    avg(x1) over (partition by grp) as avg_x1,
    avg(y) over (partition by grp) as avg_y,
    x1,
    y
  from data

),

step2 as (

  select
    grp,
    avg((x1 - avg_x1) * (y - avg_y)) / var_pop(x1) as x1_slope,
    avg(y) - avg(x1) * avg((x1 - avg_x1) * (y - avg_y)) / var_pop(x1) as intercept
  from step1
  group by grp

)

select *
from step2

Basically, it is possible to calculate covar_pop, but only by using two select statements, the first one calculating the average of the column.

That said, the code is a mess (in fairness to past me, how could it not be? This is jinja2 + SQL abuse dialed up to 11). And this preprocessing step should only occur for databases where covar_pop is not a supported method. I will not accept a PR that adds the window funcs to dbs that actually support covar_pop as it is a superfluous calculation in those cases. Adding this will be a lot of work, and I don't recommend diving down this rabbit hole. Please consider just using chol, it will make everyone's life easier. 😄 But, if you are insane enough to add this and you meet all the requirements, I salute you and I will get your change in.