PredictionIO / template-scala-parallel-universal-recommendation

PredictiionIO Template for Universal Recommender
112 stars 48 forks source link

A bug about MaxQueryEvents? #27

Closed patng323 closed 8 years ago

patng323 commented 8 years ago

According to the doc, MaxQueryEvents is an integer specifying the number of most recent primary actions used to make recommendations for an individual. More implies some will be less recent actions.

To see what will happen, I set MaxQueryEvents to 2 and debugged the code using the handmade example. Inside getBiasedRecentUserActions(), maxQueryEvents was used correctly to limit the number of events per action type. However, when the program came to URAlgorithm.scala, line 336 -

      val recentUserHistory = if ( ap.userBias.getOrElse(1f) >= 0f )
        alluserEvents._1.slice(0, ap.maxQueryEvents.getOrElse(defaultURAlgorithmParams.DefaultMaxQueryEvents) - 1)
      else List.empty[BoostableCorrelators]

Before that line was run, alluserEvents._1 contained:

alluserEvents:
 _1 = {$colon$colon@10508} "::" size = 2
  0 = {URAlgorithm$BoostableCorrelators@13237} "BoostableCorrelators(purchase,List(Ipad-retina, Iphone 5),None)"
  1 = {URAlgorithm$BoostableCorrelators@13238} "BoostableCorrelators(view,List(Phones, Mobile-acc),None)"

And after the line was executed, recentUserHistory contained:

recentUserHistory = {$colon$colon@13277} "::" size = 1
 0 = {URAlgorithm$BoostableCorrelators@13237} "BoostableCorrelators(purchase,List(Ipad-retina, Iphone 5),None)"

So it seems like the code used maxQueryEvents again but this time limit the number BoostableCorrelators. As a result, all the view events are dropped.

Is this a bug?

pferrel commented 8 years ago

Thanks Patrick, this is indeed a bug. As you say, there should be only one limit applied and the -1 is wrong in any case. I think you now know the code better than I do :-)

Fixed in 0.3.0 coming in Jan.

This illustrates the fact that the per query limit applies to all event types as a collection ordered by recency. For very small numbers you may be removing the highest quality events like "purchase". Small numbers also will reduce precision at the benefit of recency. Only A/B testing can say if that is good or not.

BYW we need to have a downsampling limit per event type since some will have a greatly different dimensionality than the primary, some may be dense (which is bad for computation speed). The maxEventsPerEventType downsampling is on the input but this issue reminded me of the need.

patng323 commented 8 years ago

This illustrates the fact that the per query limit applies to all event types as a collection ordered by recency. For very small numbers you may be removing the highest quality events like "purchase".

I set the number to 2 just for testing only. In real production environment, I think this bug wouldn't cause problem because I don't expect anyone to set it to a value which is smaller than the number of event types, and so recentUserHistory will contain all items in alluserEvents._1.

On the other hand, different from what was described in the doc, according to the code in getBiasedRecentUserActions( ) we are retaining MaxQueryEvents number of events from each event type. So the behavior is actually more like "maxEventsPerEventType for the query".

And talking about downsampling using 'max events', instead we are considering modifying the code to add a feature which allows us to specify a date range (e.g. last 2 years) for downsampling during a query, as we want recs to match the user's recent (time-wise) behaviors. But before start, we'd like to seek your opinion and advice on this first.

pferrel commented 8 years ago

Yes, this is usually not a problem but a big nonetheless.

I think you may mistake what some of the downsampling params do. The docs say:

This is the number of events retained from the input training data A, not in the query or in the A'A model. This is the first order of downsampling. This is based done in Mahout SimilarityAnalysis.

It is the second order of downsampling. It is the number of items in a row of A'A. and also done by Mahout.

These downsampling params have been studied in a paper here: https://ssc.io/pdf/rec11-schelter.pdf where the conclusion was that more events lead to very little gain in quality but can make the computation appraoch O(n^2).

The final control is the number of events used in the query and is some number of the most recent. One problem with a date range is that users will have widely varying numbers of events. In fact even an active user may have no events in your time period. Setting a number per user is more likely to get an optimal number of events and will get as recent ones as possible. Remember one or two events is not enough to show differentiation of user taste so you will get recommendations based on recent history but for many users the history is so small as to do worse than something like trending items. So be careful these changes are best A/B tested.

I did a test using all user events with a large ecom dataset and found precision increasing with the time range up to the limit of our data, which was 1 year. This isn't A/B testing but would hint that we should be careful in adjusting the events per query.

patng323 commented 8 years ago

Pat, thanks for the explanation. Will take time to study that paper.

And I have a related question. (If you think I should ask it in the forum please let me know.)

First of all, after reviewing a few different recommendation systems out there, at the end we really love the PIO Universal Recommender because of its multi-modal nature, and its flexibility in using ES for boosting and filtering! :-) It really matches our needs.

Our website has ~200,000 items, with million+ users, and currently it generates 5+ million events (one primary + 5-6 other secondary event types) each month. As a proof of concept we imported only one month's worth of events into it, and so far the training time is acceptable (less than an hour on a single machine with only 12G of RAM and 4 cores).

But going forward, we worry that:

  1. The training time will grow over time as we accumulate more and more events each month (we plan to train the model nightly)
  2. During a query, when PIO engine requests a user's history from HBase, the size of the response will grow over time as the history of each user keep growing
  3. The size of the query sent from PIO Engine to ES (and its response time) will grow over time for the same reason

I assume: For 1, the solution is to grow the size of the Spark cluster. For 2, if it ever becomes a problem, the solution is to grow the HBase cluster. For 3, the solution is to grow the ES cluster.

Are there options other than "grow the cluster"?

I believe you've dealt with scenarios which are many, many times larger. Is there any reference size number from one of the showcases?

We love the UR template, and I really hope our production rollout can be a success! :-)

pferrel commented 8 years ago

Could you ask this on the list I'd like to answer where others can benefit. We are adding a feature that maintains a moving window of events, dedups, and compacts property change events. This will keep training time constant once you reach your desired full time window. more details if you post to the list.

patng323 commented 8 years ago

Thanks. Just did! https://groups.google.com/forum/#!topic/predictionio-user/MgWwdAsOAYI