PredictionIO / template-scala-parallel-universal-recommendation

PredictiionIO Template for Universal Recommender
112 stars 48 forks source link

Universal Recommendation Template

The Universal Recommender (UR) is a Cooccurrence type that creates correlators from several user actions, events, or profile information and performs the recommendations query with a Search Engine. It also supports item properties for filtering and boosting recommendations. This allows users to make use of any part of their user's clickstream or even profile and context information in making recommendations. TBD: several forms of popularity type backfill and content-based correlators for content based recommendations. Also filters on property date ranges. With these additions it will more closely live up to the name "Universal"

Quick Start

Check the prerequisites below before setup, it will inform choices made.

  1. Install the PredictionIO framework be sure to choose HBase and Elasticsearch for storage. This template requires Elasticsearch.
  2. Make sure the PIO console and services are running, check with pio status
  3. Install this template with pio template get PredictionIO/template-scala-parallel-universal-recommendation

Import Sample Data

  1. Create a new app name, change appName in engine.json
  2. Run pio app new **your-new-app-name**
  3. Import sample events by running python examples/import_handmade.py --access_key **your-access-key** where the key can be retrieved with pio app list
  4. The engine.json file in the root directory of your new UR template is set up for the data you just imported (make sure to create a new one for your data) Edit this file and change the appName parameter to match what you called the app in step #2
  5. Perform pio build, pio train, and pio deploy
  6. To execute some sample queries run ./examples/single-query-handmade.sh

Important Notes for the Impatient

What is a Universal Recommender

The Universal Recommender (UR) will accept a range of data, auto correlate it, and allow for very flexible queries. The UR is different from most recommenders in these ways:

Typical Uses:

Configuration, Events, and Queries

Primary and Secondary Data

There must be a "primary" event/action recorded for some number of users. This action defines the type of item returned in recommendations and is the measure by which all secondary data is measured. More technically speaking all secondary data is tested for correlation to the primary event. Secondary data can be anything that you may think of as giving some insight into the user. If something in the secondary data has no correlation to the primary event it will have no effect on recommendations. For instance in an ecom setting you may want "buy" as a primary event. There may be many (but none is also fine) secondary events like (user-id, device-preference, device-id). This can be thought of as a user's device preference and recorded at all logins. If this doesn't correlate to items bought it will not effect recommendations.

Biases

These take the form of boosts and filters where a neutral bias is 1.0. The importance of some part of the query may be boosted by a positive non-zero float. If the bias is < 0 it is considered a filter—meaning no recommendation is made that lacks the filter value(s). One example of a filter is where it may make sense to show only "electronics" recommendations when the user is viewing an electronics product. Biases are often applied to a list of data, for instance the user is looking at a video page with a cast of actors. The "cast" list is metadata attached to items and a query can show "people who liked this, also liked these" type recommendations but also include the current cast boosted by 0.5. This can be seen as showing similar item recommendations but using the cast members in a way that will not overpower the similar items (since by default they have a neutral 1.0 boost). The result would be similar items favoring ones with similar cast members.

Dates

Dates can be used to filter recommendations in one of two ways, where the data range is attached to items or is specified in the query:

  1. The date range can be attached to every item and checked against the current date. The current date can be in the query or defaults to the current prediction server date. This mode requires that all items have a upper and lower date attached to them as a property. It is designed to be something like an "available after" and "expired after". The default check against server date is triggered when the expireDateName and availableDateName are both specified but no date is passed in with the query. Note: Both dates must be attached to items or they will not be recommended. To have one-sided filter make the available date some time far in the past and/or the expire date some time far in the future.
  2. A "dateRange" can be specified in the query and the recommended items will have a date that lies between the range dates.

Engine.json

This file allows the user to describe and set parameters that control the engine operations. Many values have defaults so the following can be seen as the minimum for an ecom app with only one "buy" event. Reasonable defaults are used so try this first and add tunings or new event types and item property fields as you become more familiar.

Simple Default Values

{
  "comment":" This config file uses default settings for all but the required values see README.md for docs",
  "id": "default",
  "description": "Default settings",
  "engineFactory": "org.template.RecommendationEngine",
  "datasource": {
    "params" : {
      "name": "sample-handmade-data.txt",
      "appName": "handmade",
      "eventNames": ["purchase", "view"]
    }
  },
  "sparkConf": {
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer.mb": "300",
    "spark.kryoserializer.buffer": "300m",
    "spark.executor.memory": "4g",
    "es.index.auto.create": "true"
  },
  "algorithms": [
    {
      "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",
      "name": "ur",
      "params": {
        "appName": "handmade",
        "indexName": "urindex",
        "typeName": "items",
        "comment": "must have data for the first event or the model will not build, other events are optional",
        "eventNames": ["purchase", "view"]
      }
    }
  ]
}

Complete Parameter Set

A full list of tuning and config parameters is below. See the field description for specific meaning. Some of the parameters work as defaults values for every query and can be overridden or added to in the query.

Note: It is strongly advised that you try the default/simple settings first before changing them. The possible exception is adding secondary events in the eventNames array.

{
  "id": "default",
  "description": "Default settings",
  "comment": "replace this with your JVM package prefix, like org.apache",
  "engineFactory": "org.template.RecommendationEngine",
  "datasource": {
    "params" : {
      "name": "some-data",
      "appName": "URApp1",
      "eventNames": ["buy", "view"]
    }
  },
  “comment”: “This is for Mahout and Elasticsearch, the values are minimums and should not be removed”,
  "sparkConf": {
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer.mb": "200",
    "spark.executor.memory": "4g",
    "es.index.auto.create": "true"
  },
  "algorithms": [
    {
      "name": "ur",
      "params": {
        "appName": "URApp1",
        "indexName": "urindex",
        "typeName": "items",
        "eventNames": ["buy", "view"],
        "blacklistEvents": ["buy", "view"],
        "maxEventsPerEventType": 100,
        "maxCorrelatorsPerEventType": 50,
        "maxQueryEvents": 500,
        "num": 20,
        "seed": 3,
        "recsModel": "all",
        "backfillField": {
            "backfillType": "popular",
            "eventnames": ["buy", "view"],
            "duration": 259200
        },
        "expireDateName": "expireDateFieldName",
        "availableDateName": "availableDateFieldName",
        "dateName": "dateFieldName",
        "userbias": -maxFloat..maxFloat,
        "itembias": -maxFloat..maxFloat,
        "returnSelf": true | false,
        “fields”: [
          {
            “name”: ”fieldname”,
            “values”: [“fieldValue1”, ...],
            “bias”: -maxFloat..maxFloat,
          },...
        ]
      }
    }
  ]
}

The “params” section controls most of the features of the UR. Possible values are:

Queries

Simple Personalized Query

{
  “user”: “xyz”
}

This gets all default values from the engine.json and uses only action correlators for the types specified there.

Simple Similar Items Query

{
  “item”: “53454543513”
}

This returns items that are similar to the query item, and blacklist and backfill are defaulted to what is in the engine.json

Full Query Parameters

Query fields determine what data is used to match when returning recommendations. Some fields have default values in engine.json and so may never be needed in individual queries. On the other hand all values from engine.json may be overridden or added to in an individual query. The only requirement is that there must be a user or item in every query.

{
  “user”: “xyz”,
  “userBias”: -maxFloat..maxFloat,
  “item”: “53454543513”,
  “itemBias”: -maxFloat..maxFloat,
  “num”: 4,
  "fields”: [
    {
      “name”: ”fieldname”,
      “values”: [“fieldValue1”, ...],
      “bias”: -maxFloat..maxFloat
    }, ...
  ]
  "dateRange": {
    "name": "dateFieldName",
    "beforeDate": "2015-09-15T11:28:45.114-07:00",
    "afterDate": "2015-08-15T11:28:45.114-07:00"
  },
  "currentDate": "2015-08-15T11:28:45.114-07:00",
  “blacklistItems”: [“itemId1”, “itemId2”, ...],
  "returnSelf": true | false,
}

All query params are optional, the only rule is that there must be an item or user specified. Defaults are either noted or taken from algorithm values, which themselves may have defaults. This allows very simple queries for the simple, most used cases.

The query returns personalized recommendations, similar items, or a mix including backfill. The query itself determines this by supplying item, user or both. Some examples are:

Contextual Personalized

{
  “user”: “xyz”,
  “fields”: [
    {
      “name”: “categories”
      “values”: [“series”, “mini-series”],
      “bias”: -1 // filter out all except ‘series’ or ‘mini-series’
    },{
      “name”: “genre”,
      “values”: [“sci-fi”, “detective”],
      “bias”: 1.02 // boost/favor recommendations with the `genre’ = `sci-fi` or ‘detective’
    }
  ]
}

This returns items based on user "xyz" history filtered by categories and boosted to favor more genre specific items. The values for fields have been attached to items with $set events where the “name” corresponds to a doc field and the “values” correspond to the contents of the field. The “bias” is used to indicate a filter or a boost. For Solr or Elasticsearch the boost is sent as-is to the engine and it’s meaning is determined by the engine (Lucene in either case). As always the blacklist and backfill use the defaults in engine.json.

Date ranges as query filters

When the a date is stored in the items properties it can be used in a date range query. This is most often used by the app server since it may know what the range is, while a client query may only know the current date and so use the "Current Date" filter below.

{
  “user”: “xyz”,
  “fields”: [
    {
      “name”: “categories”,
      “values”: [“series”, “mini-series”],
      “bias”: -1 // filter out all except ‘series’ or ‘mini-series’
    },{
      “name”: “genre”,
      “values”: [“sci-fi”, “detective”],
      “bias”: 1.02 // boost/favor recommendations with the `genre’ = `sci-fi` or ‘detective’
    }
  ],
  "dateRange": {
    "name": "availabledate",
    "before": "2015-08-15T11:28:45.114-07:00",
    "after": "2015-08-20T11:28:45.114-07:00"
  }
}

Items are assumed to have a field of the same name that has a date associated with it using a $set event. The query will return only those recommendations where the date field is in range. Either date bound can be omitted for a on-sided range. The range applies to all returned recommendations, even those for popular items.

Current Date as a query filter

When setting an available date and expire date on items, the current date can be used as a filter, the UR will check that the current date is before the expire date, and after or equal to the available date. You can use either expire date or available date or both. The names of these item fields is specified in the engine.json.

{
  “user”: “xyz”,
  “fields”: [
    {
      “name”: “categories”,
      “values”: [“series”, “mini-series”],
      “bias”: -1 // filter out all except ‘series’ or ‘mini-series’
    },{
      “name”: “genre”,
      “values”: [“sci-fi”, “detective”],
      “bias”: 1.02
    }
  ],
  "currentDate": "2015-08-15T11:28:45.114-07:00"
}

Contextual Personalized with Similar Items

{
  “user”: “xyz”,
  "userBias": 2, // favor personal recommendations
  “item”: “53454543513”, // fallback to contextual recommendations
  “fields”: [
    {
      “name”: “categories”,
      “values”: [“series”, “mini-series”],
      “bias”: -1 // filter out all except ‘series’ or ‘mini-series’
    },{
      “name”: “genre”,
      “values”: [“sci-fi”, “detective”],
      “bias”: 1.02 // boost/favor recommendations with the `genre’ = `sci-fi` or ‘detective’
    }
  ]
}

This returns items based on user xyz history or similar to item 53454543513 but favoring user history recommendations. These are filtered by categories and boosted to favor more genre specific items.

Note: This query should be considered experimental. Mixing user history with item similarity is possible but may have unexpected results. If you use this you should realize that user and item recommendations may be quite divergent and so mixing the them in query may produce nonsense. Use this only with the engine.json settings for "userbias" and "itembias" to favor one over the other.

Popular Items

{
}

This is a simple way to get popular items. All returned scores will be 0 but the order will be based on relative popularity. Field-based biases for boosts and filters can also be applied.

Events

The Universal takes in potentially many events. These should be seen as a primary event, which is a very clear indication of a user preference and secondary events that we think may tell us something about user "taste" in some way. The Universal Recommender is built on a distributed Correlation Engine so it will test that these secondary events actually relate to the primary one and those that do not correlate will have little or no effect on recommendations (though they will make it longer to train and get query results). It is recommended that you start with one or two events and increase the number as you see how these events effect results and timing.

Usage Events

Events in PredicitonIO are sent to the EventSever in the following form:

{
    "event" : "purchase",
    "entityType" : "user",
    "entityId" : "1243617",
    "targetEntityType" : "item",
    "targetEntityId" : "iPad",
    "properties" : {},
    "eventTime" : "2015-10-05T21:02:49.228Z"
}

This is what a "purchase" event looks like. Note that a usage event always is from a user and has a user id. Also the "targetEntityType" is always "item". The actual target entity is implied by the event name. So to create a "category-preference" event you would send something like this:

{
    "event" : "category-preference",
    "entityType" : "user",
    "entityId" : "1243617",
    "targetEntityType" : "item",
    "targetEntityId" : "electronics",
    "properties" : {},
    "eventTime" : "2015-10-05T21:02:49.228Z"
}

This event would be sent when the user clicked on the "electronics" category or perhaps purchased an item that was in the "electronics" category. Note that the "targetEntityType" is always "item".

Property Change Events

To attach properties to items use a $set event like this:

{
    "event" : "$set",
    "entityType" : "item",
    "entityId" : "ipad",
    "properties" : {
        "category": ["electronics", "mobile-phones"],
        "expireDate": "2016-10-05T21:02:49.228Z",
        "availableDate": "2015-10-05T21:02:49.228Z"
    },
    "eventTime" : "2015-10-05T21:02:49.228Z"
}

Unless a property has a special meaning specified in the engine.json, like date values, the property is assumed to be an array of strings, which act as categorical tags. You can add things like "premium" to the "tier" property then later if the user is a subscriber you can set a filter that allows recommendations from "tier": ["free", "premium"] where a non subscriber might only get recommendations for "tier": ["free"]. These are passed in to the query using the "fields" parameter (see Contextual queries above).

Using properties is how boosts and filters are applied to recommended items. It may seem odd to treat a category as a filter and as a secondary event (category-preference) but the two pieces of data are used in quite different ways. As properties they bias the recommendations, when they are events they add to user data that returns recommendations. In other words as properties they work with boost and filter business rules as secondary usage events they show something about user taste to make recommendations better.

Creating a New Model or Adding Event Types

To begin using new data with an engine that has been used with sample data or using different events follow these steps:

  1. Create a new app name, backup your old engine.json and change appName in the new engine.json
  2. Run pio app new **your-new-app-name**
  3. Make any changes to engine.json to specify new event names and config values. Make sure "eventNames": ["**your-primary-event**", "**a-secondary-event**", "**another-secondary-event**", ...] contains the exact string used for your events and that the primary one is first in the list.
  4. Import new events or allow enough to accumulate into the EventStore. If you are using sample events from a file run python examples/**your-python-import-script**.py --access_key **your-access-key** where the key can be retrieved with pio app list
  5. Perform pio build, pio train, and pio deploy
  6. Copy and edit the sample query script to match your new data. For new user ids pick a user that exists in the events, same for metadata fields, and items.
  7. Run your edited query script and check the recommendations.

Tests

Integration test: Once PIO and all services are running but before any model is deployed, run ./examples/integration-test This will print a list of differences in the actual results from the expected results, none means the test passed. Not that the model will remain deployed and will have to be deployed over or killed by pid.

Event name restricted query test: this is for the feature that allows event names to be specified in the query. It restricts the user history that is used to create recommendations and is primarily for use with the MAP@k cross-validation test. The engine config removes the blacklisting of items so it must be used when doing MAP@k calculations. This test uses the simple sample data. Steps to try the test are:

  1. start pio and all services
  2. pio app new handmade
  3. python examples/import_handmade.py --access_key <key-from-app-new>
  4. cp engine.json engine.json.orig
  5. cp event-names-test=engine.json engine.json
  6. pio train
  7. pio deploy
  8. ./examples/single-eventNames-query.sh
  9. restore engine.json
  10. kill the deployed prediction server

MAP@k: This tests the predictive power of each usage event/indicator. All eventNames used in queries must be removed from the blacklisted events in the engine.json used for a particular dataset. So if "eventNames": ["purchase","view"] is in the engine.json for the dataset, these events must be removed from the blacklist with "blacklist": [], which tells the engine to not blacklist items with eventNames for a user. Allowing blacklisting will artificially lower MAP@k and so not give the desired result.

Versions

v0.2.3

v0.2.2

v0.2.1

v0.2.0

v0.1.1

v0.1.0

Known issues

References

License

This Software is licensed under the Apache Software Foundation version 2 licence found here: http://www.apache.org/licenses/LICENSE-2.0