benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

curious what you would recommend for real-time training + prediction models? #491

Open victusfate opened 3 years ago

victusfate commented 3 years ago

I admire the api, efficiency, and results of implicit.

I'm finding a need for real time training + prediction in some of my company's systems, and started searching around for ideas/implementations. Has anyone had experience working with this?

Realize this is off topic from implicit (totally understand if it's closed). Starting to look for ideas here:

victusfate commented 3 years ago

After spending some time looking at hrnn and implementations, I switched gears to something simpler to support continuous learning https://github.com/online-ml/river

victusfate commented 2 years ago

If anyone's curious I'm building an open source version here https://github.com/victusfate/concierge Just hooked up redis pubsub events into updating the model today

Todo: on server startup get all events since last model training and update each model

benfred commented 2 years ago

There are two different things you can do here with implicit to get near-realtime updates with the ALS model :

1) You can set the recalculate_user flag on the model.recommend calls to automatically regenerate the user representation . This lets your recommendations react to changes in what the user has interacted with at inference time.

2) I've added support for incremental retraining for ALS models just now with PR #527 - which will let you update the model with new items or users, as well as let you recalculate existing items with new interactions.

victusfate commented 2 years ago

This is great news, I'd love to compare the results to river-ml since I have more experience with implicit. When it's ready for review, it'd be great to see a small sample program/example with live updates to the model for recommendations Oh it's already ready to try out, I'll get this on my schedule.

Also worth noting I got the deployed system to work great.

I gather all user item ratings hourly for a full training (snapshot model). When new servers come up they load this model and then delta train from a redis ordered set of all user item ratings since the last model snapshot. In addition live models receive real time updates via redis pubsub.

This way at scale, I can have multiple predictor http servers all yielding similar results (can't guarantee they all receive all updates in the same order), but they are generally convergent. https://github.com/online-ml/river/discussions/803

sorenrife commented 2 years ago

In the case where a user is new, but the server is incapable to fit it yet into the model (as @victusfate explained, cause a pub/sub flow to add new users/items should preferably have certain delay for performance optimisation); How could I recommend to this new user?

Should I use the recommend method with a random userid and pass to user_items the few interactions of this new user? If that is true, could make sense to make the userid parameter optional?

(This assumption is made by not knowing the truly relevance of the userid in the recommend method if the recalculate_user flag is true)

victusfate commented 2 years ago

@sorenrife I ended up using popular results for new users in my current deployment using implicit (just hourly trained atm), and I think you can take the same approach with live model updates (keep an active popularity rank going as ratings come in)

something like this (grabbing code snippets from my hourly training) -> df is a pandas data set

    pr = df.groupby([constants.ITEM_COLUMN])[constants.RATING_COLUMN].sum()
    pr = (pr-pr.min())/(pr.max()-pr.min())
    self.item_popularity_map = pr.to_dict()
    self.item_popularity_map = {k: v for k, v in sorted(self.item_popularity_map.items(), key=lambda item: item[1],reverse=True)}

and in the rankings method

  def rankings(self,user_id: str,selected_items):
    ranks = {}
    selected_idx = []
    for selected_item in selected_items:
      selected_idx.append(self.inv_item_map[selected_item])

    # handle novel / unknown users with popularity rank
    if user_id not in self.inv_user_map:
      try:
        # print('rankings selected_items',selected_items)
        for k in selected_idx:
          item_name = self.item_map[k]
          score     = self.item_popularity_map[k]
          # print('rankings k',k,'item_name',item_name,'score',score)
          ranks[item_name] = float(score)
      except Exception as e:
        print('ImplicitPredictor.rankings popularity exception',e)
    else:
      user_idx = self.inv_user_map[user_id]
      try:
        rankings = self.model.rank_items(user_idx, self.user_items, selected_idx)
        for item_idx,prob in rankings:
          item_name = self.item_map[item_idx]
          ranks[item_name] = float(prob)
      except Exception as e:
        print('rankings exception',e)
    return ranks