AutoViML / Auto_ViML

Automatically Build Multiple ML Models with a Single Line of Code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Apache License 2.0
526 stars 102 forks source link

Sample Weight Support for Regression Problems #23

Closed kmedved closed 1 year ago

kmedved commented 3 years ago

First want to say thank you for the very interesting looking library. I've tried it briefly, and gotten very strong performance.

I wanted to ask whether it would be possible to add sample weight support for regression problems. This is typically done in scikit-learn estimators by simply passing a sample_weight parameter after X, and y. For example, LinearRegression, XGBoost, or Catboost all support the same API, so I'm hopeful this is a fairly straightforward addition.

Under the hood it's typically just multiplying the loss for each row by the sample weight, in order to give certain observations more weight than others. This can be very helpful for problems where you have sensor data with different quality sensors, or for simply downweighting older observations.

rsesha commented 3 years ago

Hi kmedved: You have brought up an interesting idea. Would you like to pass the sample_weight parameter to Auto_ViML or would you like Auto_ViML to calculate sample_weight itself? What would you prefer? Also would you be able to pass it to Auto_ViML as an array? What if the target happens to be multi-label? Thanks for answering these questions. Ram

kmedved commented 3 years ago

That's an interesting question @rsesha. I was envisioning sample_weight as something the user would pass themselves, in cases where they know the sample weight in advance (e.g., if your individual rows are already rolled up observations, such that some rows represents 5 days of sensor data, other rows represent 3, etc...).

You're right however that in the time-series context, letting Auto_ViML calculate the appropriate weighting scheme could be valuable. However, that's a pretty huge increase in complexity, both in computational terms and coding terms. So I think just letting the user pass an array, similar to how other scikit-learn estimators work would make sense as an initial stage. I would think this could be done with relatively little effort given all the underlying estimators already support this functionality.

I hadn't realized Auto_ViML supported multi-target regression, but the simple answer is in those cases the user can pass multiple arrays, similar to how they're passing the multiple labels. So your data is 1000 rows, and you're predicting 3 targets, then the user would pass 3 columns of 1000 rows apiece for the sample weight. That's how Catboost's MultiRMSE works..

rsesha commented 3 years ago

Ok then let me provide an input argument called sample_weights for Autoviml and let users like you pass it to the program. Let me work on it. Ram

On Mon, Apr 26, 2021 at 8:10 PM kmedved @.***> wrote:

That's an interesting question @rsesha https://github.com/rsesha. I was envisioning sample_weight as something the user would pass themselves, in cases where they know the sample weight in advance (e.g., if your individual rows are already rolled up observations, such that some rows represents 5 days of sensor data, other rows represent 3, etc...).

You're right however that in the time-series context, letting Auto_ViML calculate the appropriate weighting scheme could be valuable. However, that's a pretty huge increase in complexity, both in computational terms and coding terms. So I think just letting the user pass an array, similar to how other scikit-learn estimators work would make sense as an initial stage. I would think this could be done with relatively little effort given all the underlying estimators already support this functionality.

I hadn't realized Auto_ViML supported multi-target regression, but the simple answer is in those cases the user can pass multiple arrays, similar to how they're passing the multiple labels. So your data is 1000 rows, and you're predicting 3 targets, then the user would pass 3 columns of 1000 rows apiece for the sample weight. That's how Catboost's MultiRMSE works. https://catboost.ai/docs/concepts/loss-functions-multiregression.html.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AutoViML/Auto_ViML/issues/23#issuecomment-827218320, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEUZ7DRQ73RLDTHBTAVQLLTKX6IFANCNFSM43N6HOUA .

AutoViML commented 1 year ago

Hi @kmedved 👍 Can you create a pull request and take a crack at writing some code? I don't have much time and would appreciate the help, Thanks AutoVimal