dwreeves / dbt_linreg

Linear regression in SQL using dbt
MIT License
64 stars 2 forks source link

Coefficient returned as integer values #16

Closed steelcd closed 3 months ago

steelcd commented 3 months ago

Hello,

I'm testing the linear regression on a small data set and I'm finding the coefficients are being returned as integers. I'm running this on AWS Redshift.

These is a snapshot of the sql code:

`coefficients as ( select * from {{ dbt_linreg.ols( table = 'regression_prep', endog = 'cylinder_pressure', exog = ['days_since_first_calibration'], group_by = ['cylinder_id'], format = 'wide', format_options={'round': 5} ) }} ),

final as ( select a.cylinder_id, a.const as contstant, a.days_since_first_calibration as days_coeff, b.calibration_date as latest_test_date, b.earliest_test_date, b.cylinder_pressure as latest_cylinder_pressure, dateadd( days, (a.const/(-1 * a.days_since_first_calibration))::int ,b.earliest_test_date )::date as forecast_stockout from coefficients as a inner join latest_calibrations_by_cylinder_id as b on a.cylinder_id = b.cylinder_id where a.days_since_first_calibration != 0 )

select * from final`

image

Is there something I'm doing wrong or is this a limitation of redshift? Please let me know if there other info that would be useful.

dwreeves commented 3 months ago

Hi @steelcd, thanks for using the package and for reporting the issue!

That's odd. I don't have access to Redshift, so I can only take a guess. My guess is, perhaps the variable days_since_first_calibration is an integer?

In your regression_prep CTE, if the days_since_first_calibration variable is an integer, can you try casting it as a float or double? e.g. cast(days_since_first_calibration as double precision) as days_since_first_calibration.

(I'm weary to update the dbt_linreg code to cast values for users, at least by default / without their permission. I could add an option like cast_columns_as which handles it if there is any reason that manually casting is clunky.)

Let me know if that works. If that doesn't work, then I'll have to look into it a bit. 🤔

steelcd commented 3 months ago

Hi @dwreeves ,

Thanks for the quick reply!

I tried specifically casting the exog column, but it's still resulting in the same behavior. I'm using much smaller test data set now:

`with regression_prep as ( select 'cylID1'::text as cylinder_id, '1/1/2024'::date as calibration_date, 500::double precision as cylinder_pressure, '1/1/2024'::date as earliest_test_date, (datediff(days, '1/1/2024', '1/1/2024')+.001)::double precision as days_since_first_calibration

union all

select
    'cylID1'::text as cylinder_id,
    '3/1/2024'::date as calibration_date,
    400::double precision as cylinder_pressure,
    '1/1/2024'::date as earliest_test_date,
    (datediff(days, '1/1/2024', '3/1/2024')+.001)::double precision as days_since_first_calibration

),

coefficients as ( select * from {{ dbt_linreg.ols( table = 'regression_prep', endog = 'cylinder_pressure', exog = ['days_since_first_calibration'], group_by = ['cylinder_id'], method = 'chol', format = 'wide' ) }} )

select * from coefficients`

image

Is there anything in the macro you think I could update on my end to test if it might help?

Thanks!

dwreeves commented 3 months ago

@steelcd Thanks for the code example; that should be enough for me to work with.

This will require a little more investigation from me. Not having access to Redshift and not understanding it may be a barrier, just to be clear. I will try to see what I can do without that access. I will also not be able to get to this until the weekend.

I cannot think of anything in the options that could help, especially if round didn't work.

steelcd commented 3 months ago

@dwreeves ,

I did some fiddling around and found if I change this from numeric to float it provides the decimals. Not sure if that's helpful or would work in the broader scope of compatibility with other platforms.

utils.sql ''' {% macro postgres___mayberound(x, round) %} {% if round is not none %} {{ return('round((' ~ x ~ ')::numeric, ' ~ round ~ ')') }} {% else %} {{ return('(' ~ x ~ ')::float') }} {% endif %} {% endmacro %} '''

It looks like Redshift doesn't like numeric without specifying the scale/precision. image

Appreciate the help; really looking forward to implementing this to provide tailored forecasting.

dwreeves commented 3 months ago

Ahhhhhh! I hadn't even considered that this was realted to Redshift being based on Postgres and thus using the Postgres adapter. 🤦 I thought those would be different adapters entirely.

I can definitely fix this. If you don't mind me fixing it, I can do it; just give me a few days.

I think the fix will be:

It doesn't seem like you have much of a public profile but regardless, if you'd like, I can add you as a co-author to the git commit since you discovered the fix.

Thanks for looking into this and thanks for your help in making the library better.

steelcd commented 3 months ago

Hi @dwreeves ,

I think an adapter specific to redshift will work. From my testing, it appears the method for the default adapter works with redshift.

image

image

image

image

image

I can put through a PR for this.

dwreeves commented 3 months ago

thanks so much for your contribution!

Have fun running linear regression in SQL! 🎉