How to validate results?

felixding commented 3 years ago

Background

Our website is a typical e-commerce site with around 700 orders each day. To boost our sales, we are trying to implement "people who bought this also bought XYZ".

Using Disco

Our current implementation is something like:

data = []

orders.each do |o|
  o.items.each do |i|
    # we actually also put item quantity into consideration. E.g. if user purchases 3 item A, we add 3 entries into the data array
    data << {user_id: current_user.id, item_id: i.id}
  end
end

recommender.fit(data)

recommendations = recommender.item_recs(current_item, count: 10)

However, the results seem way off. Changing factors to 500 makes it better but still the results don't make a lot of sense.

Using Predictor

We also tried Predictor. The results look much better and seem to be logical to us. But to be honest, we don't know how exactly can we validate.

Validation

Please excuse my ignorance, but the only way I can think of is to count the item occurrences.

For example, if we want to find recommendations for item A:

Get orders that have item A
Get items from each order
Count their occurrences
Sort by occurrences

We then have a list of the most popular items sold along with item A:

Item B (sold 100 times)
Item C (sold 97 times)
Item D (sold 90 times)
...

This list does make sense to us in terms of people's purchase preferences.

Questions

A couple of questions regarding the process:

Is the validation correct?
If yes, then why are the results from Disco so off?
If we can get recommendations by simply counting occurrences, why do we need a recommendation engine in the first place?

Thank you. And apology for the super long issue.

ankane commented 3 years ago

Hey @felixding, I'm by no means an expert on recommendations, but will try my best to answer.

Evaluation

The approach described above will tend to recommend globally popular items, which probably isn't what you want.

For online evaluation, use an A/B test to compare different approaches.

For offline evaluation, there are a number of metrics for implicit feedback, but I believe they're typically used with user-based recommendations (user_recs method) rather than similar items (item_recs method).

Disco Results

I'd double check the training data is correct.

Also, only pass each user_id/item_id combination once (recommender.fit(data.uniq)). There's not a way to indicate the number of times a user ordered an item with Disco, but cmfrec has this functionality and a similar API (still only pass each user_id/item_id combination once, but include a value).

Depending on the complexity of your models, you may be able to get the data in a single query:

# Disco
data =
  Order.joins(:items).distinct.pluck("orders.user_id", "items.id").map do |user_id, item_id|
    {user_id: user_id, item_id: item_id}
  end

# cmfrec
data =
  Order.joins(:items).group("orders.user_id", "items.id").count.map do |(user_id, item_id), value|
    {user_id: user_id, item_id: item_id, value: value}
  end

500 seems a bit high for factors - I'd try 20-100.

Predictor uses the Jaccard index (memory-based CF) while Disco and cmfrec use matrix factorization (model-based CF), which are different approaches to collaborative filtering, but I'd expect the results for each to make sense.

felixding commented 3 years ago

Thanks for your prompt reply.

We want an offline, item-based recommendation system. We do not intend to generate per-user recommendations. Therefore I guess item_recs is what we need?

How to double check the data? Anything we should particularly pay attention to? In our case, it's a simple array like this:

[
  {user_id: uuid1, item_id: 'item name 1'},
  {user_id: uuid2, item_id: 'item name 2'}
]

The item_id here is actually not an integer ID, nor a UUID string, but a string (item name). Can we use a string instead of an integer?

Well noted on only passing user_id / item_id only once.

Last but not least, does our way of validation make sense? Should we use this to get recommendations?

ankane commented 3 years ago

For the data:

Strings are fine (just make sure multiple items don't have the same name)
Try filtering out users with less than 5 items (this may fix it)
Look at the data and statistics to see if anything looks off

puts "Users: #{recommender.user_ids.size}"
puts "Items: #{recommender.item_ids.size}"
puts "Observations: #{data.uniq.size}"

puts "Sample users: #{recommender.user_ids.sample(50)}"
puts "Sample items: #{recommender.item_ids.sample(50)}"
puts "Sample observations: #{data.uniq.sample(50)}"

For validation, the approach you've described is a way to do recommendations, not a way to evaluate them. As I mentioned earlier, it'll tend to recommend globally popular items, so many items will have similar recommendations (which may not be the best user experience, but you can decide for yourself). I'm not aware of a great way to evaluate item-based recommendations offline, so I would run an A/B/C/D test to see how different approaches perform with your users.

felixding commented 3 years ago

We ended up using our own way for recommendations. Thank you for your help!

ankane / disco