Distinguishing 0s from missing data

joseduc10 commented 3 years ago

Hello David,

Thank you very much for sharing this module. It's been really useful.

I am trying to fit a model on a dataframe D with the option to optimize the likelihood in a validation set. The way I am preparing the validation set is by setting apart a random subset of rows of D, call this subset D'. Then, my training set is D - D', and my validation set is D'.

Is this the intended way to use the package? To me, it seems that, by removing rows from the training set, we are implicitly stating that these entries are 0 in the user-item matrix. I would expect that we have to modify the likelihood calculation to account for the fact that D' is missing data, not 0s. I read the code, but I couldn't find any kind of accounting.

As a follow-up question, wouldn't it be desirable to have both zero AND non-zero entries in the validation set? By not allowing non-zero entries, aren't we biasing the inference?

Thank you very much for reading!

Best,

Jose

david-cortes commented 3 years ago

That is correct, if you pass a validation set, it will take those entries as if they were zeros in the training data, and will only consider non-zero entries in the validation set. If there were to be a correction, it'd have to calculate separate aggregate statistics of items for each user who is in the validation set (and same thing for user statistics for items), and it would be much slower.

joseduc10 commented 3 years ago

I see. Thank you for the explanation. Looking back at the paper by Gopalan, Hofman, and Blei, it looks like we would have to keep track of (u, i) entries that are in the validation set and NOT include the corresponding (\sigma_{u}^T \beta_i) terms in the likelihood computation. I may try adding some bookkeeping to my local version of your package.

Thank you for the quick response!

david-cortes / hpfrec

Distinguishing 0s from missing data #11