NicolasHug / Surprise

A Python scikit for building and analyzing recommender systems
http://surpriselib.com
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.01k forks source link

Add sample_weight option #170

Open martincousi opened 6 years ago

martincousi commented 6 years ago

How hard would it be to modify the current algorithms to include a sample_weight option to the fit method as in sklearn (e.g., LinearRegression.fit)? Is it just a matter of changing the update rule and can all algorithms handle such a parameter?

By default, the sample weight would be 1 for all trainset observations and then different ways to compute these weights would be available (e.g., propensity score). For an example of why these weights are useful, see [1]. These weights can be computed only with the implicit matrix or with the available user/item features.

[1] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims, “Recommendations as Treatments: Debiasing Learning and Evaluation,” in Proceedings of the 33rd International Conference on Machine Learning, 2016.

NicolasHug commented 6 years ago

Given the differences between surprise and scikit-learn, I think the analoguous way of passing the weights into the fit method would be for us to include it in the trainset object.

But as far as I understand the weights are mostly used for computing evaluation metrics (which in turn can be used as an optimization criterion for training a model)?

martincousi commented 6 years ago

So, sample weights could be passed by the user to the trainset object or computed in the trainset object according to one of many options. This option might need to be an option of cross_validate?

Then, the algorithm would used these sample weights (if it can use them) in the fit method. Should a boolean option use_sample_weight be defined in the algorithm __init__ method?

NicolasHug commented 6 years ago

I'm not completely clear on what the weights are for TBH.

Is it just for computing error / accuracy metrics? In this case the changes could be minimal and restrained to the accuracy module.

The algorithms that are currently implemented do not support a sample weight (to my knowledge). So unless there are some new algorithms you want to implement that do require such a parameter, I don't really see the value of adding it.

This option might need to be an option of cross_validate?

Yes, and that might be tricky. As far as I understand it scikit-learn does not even have a way to deal with cross-validation and sample_weights