The Universal Recommender (UR) is a Cooccurrence type that creates correlators from several user actions, events, or profile information and performs the recommendations query with a Search Engine. It also supports item properties for filtering and boosting recommendations. This allows users to make use of any part of their user's clickstream or even profile and context information in making recommendations. TBD: several forms of popularity type backfill and content-based correlators for content based recommendations. Also filters on property date ranges. With these additions it will more closely live up to the name "Universal"
Check the prerequisites below before setup, it will inform choices made.
pio status
pio template get PredictionIO/template-scala-parallel-universal-recommendation
appName
in engine.json
pio app new **your-new-app-name**
python examples/import_handmade.py --access_key **your-access-key**
where the key can be retrieved with pio app list
appName
parameter to match what you called the app in step #2pio build
, pio train
, and pio deploy
./examples/single-query-handmade.sh
pio train
time.The Universal Recommender (UR) will accept a range of data, auto correlate it, and allow for very flexible queries. The UR is different from most recommenders in these ways:
There must be a "primary" event/action recorded for some number of users. This action defines the type of item returned in recommendations and is the measure by which all secondary data is measured. More technically speaking all secondary data is tested for correlation to the primary event. Secondary data can be anything that you may think of as giving some insight into the user. If something in the secondary data has no correlation to the primary event it will have no effect on recommendations. For instance in an ecom setting you may want "buy" as a primary event. There may be many (but none is also fine) secondary events like (user-id, device-preference, device-id). This can be thought of as a user's device preference and recorded at all logins. If this doesn't correlate to items bought it will not effect recommendations.
These take the form of boosts and filters where a neutral bias is 1.0. The importance of some part of the query may be boosted by a positive non-zero float. If the bias is < 0 it is considered a filter—meaning no recommendation is made that lacks the filter value(s). One example of a filter is where it may make sense to show only "electronics" recommendations when the user is viewing an electronics product. Biases are often applied to a list of data, for instance the user is looking at a video page with a cast of actors. The "cast" list is metadata attached to items and a query can show "people who liked this, also liked these" type recommendations but also include the current cast boosted by 0.5. This can be seen as showing similar item recommendations but using the cast members in a way that will not overpower the similar items (since by default they have a neutral 1.0 boost). The result would be similar items favoring ones with similar cast members.
Dates can be used to filter recommendations in one of two ways, where the data range is attached to items or is specified in the query:
This file allows the user to describe and set parameters that control the engine operations. Many values have defaults so the following can be seen as the minimum for an ecom app with only one "buy" event. Reasonable defaults are used so try this first and add tunings or new event types and item property fields as you become more familiar.
{
"comment":" This config file uses default settings for all but the required values see README.md for docs",
"id": "default",
"description": "Default settings",
"engineFactory": "org.template.RecommendationEngine",
"datasource": {
"params" : {
"name": "sample-handmade-data.txt",
"appName": "handmade",
"eventNames": ["purchase", "view"]
}
},
"sparkConf": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "300",
"spark.kryoserializer.buffer": "300m",
"spark.executor.memory": "4g",
"es.index.auto.create": "true"
},
"algorithms": [
{
"comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",
"name": "ur",
"params": {
"appName": "handmade",
"indexName": "urindex",
"typeName": "items",
"comment": "must have data for the first event or the model will not build, other events are optional",
"eventNames": ["purchase", "view"]
}
}
]
}
A full list of tuning and config parameters is below. See the field description for specific meaning. Some of the parameters work as defaults values for every query and can be overridden or added to in the query.
Note: It is strongly advised that you try the default/simple settings first before changing them. The possible exception is adding secondary events in the eventNames
array.
{
"id": "default",
"description": "Default settings",
"comment": "replace this with your JVM package prefix, like org.apache",
"engineFactory": "org.template.RecommendationEngine",
"datasource": {
"params" : {
"name": "some-data",
"appName": "URApp1",
"eventNames": ["buy", "view"]
}
},
“comment”: “This is for Mahout and Elasticsearch, the values are minimums and should not be removed”,
"sparkConf": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "200",
"spark.executor.memory": "4g",
"es.index.auto.create": "true"
},
"algorithms": [
{
"name": "ur",
"params": {
"appName": "URApp1",
"indexName": "urindex",
"typeName": "items",
"eventNames": ["buy", "view"],
"blacklistEvents": ["buy", "view"],
"maxEventsPerEventType": 100,
"maxCorrelatorsPerEventType": 50,
"maxQueryEvents": 500,
"num": 20,
"seed": 3,
"recsModel": "all",
"backfillField": {
"backfillType": "popular",
"eventnames": ["buy", "view"],
"duration": 259200
},
"expireDateName": "expireDateFieldName",
"availableDateName": "availableDateFieldName",
"dateName": "dateFieldName",
"userbias": -maxFloat..maxFloat,
"itembias": -maxFloat..maxFloat,
"returnSelf": true | false,
“fields”: [
{
“name”: ”fieldname”,
“values”: [“fieldValue1”, ...],
“bias”: -maxFloat..maxFloat,
},...
]
}
}
]
}
The “params” section controls most of the features of the UR. Possible values are:
pio app list
http:/**elasticsearch-machine**/indexName/typeName/...
You can access ES through its REST interface here.eventNames
.dateRange
recommendations filter.eventNames
, corresponding to the primary action, duration = 259200, which is the number of seconds in a 3 days. The primary/first event used for recommendations is always attached to items you wish to recommend, the other events are not necessarily attached to the same items. If events like "category-preference" are used then popular categories will be calculated and this will have no effect for backfill. Possible backfillTypes are "popular", "trending", and "hot", which correspond to the number of events in the duration, the average event velocity or the average event acceleration over the time indicated. This is calculated for every event and is used to rank them and so can be used with biasing metadata so you can get, for instance, hot items in some category. Note: when using "hot" the algorithm divides the events into three periods and since event tend to be cyclical by day, 3 days will produce results mostly free of daily effects for all types. Making this time period smaller may cause odd effects from time of day the algorithm is executed. Popular is not split and trending splits the events in two. So choose the duration accordingly.{
“user”: “xyz”
}
This gets all default values from the engine.json and uses only action correlators for the types specified there.
{
“item”: “53454543513”
}
This returns items that are similar to the query item, and blacklist and backfill are defaulted to what is in the engine.json
Query fields determine what data is used to match when returning recommendations. Some fields have default values in engine.json and so may never be needed in individual queries. On the other hand all values from engine.json may be overridden or added to in an individual query. The only requirement is that there must be a user or item in every query.
{
“user”: “xyz”,
“userBias”: -maxFloat..maxFloat,
“item”: “53454543513”,
“itemBias”: -maxFloat..maxFloat,
“num”: 4,
"fields”: [
{
“name”: ”fieldname”,
“values”: [“fieldValue1”, ...],
“bias”: -maxFloat..maxFloat
}, ...
]
"dateRange": {
"name": "dateFieldName",
"beforeDate": "2015-09-15T11:28:45.114-07:00",
"afterDate": "2015-08-15T11:28:45.114-07:00"
},
"currentDate": "2015-08-15T11:28:45.114-07:00",
“blacklistItems”: [“itemId1”, “itemId2”, ...],
"returnSelf": true | false,
}
beforeDate
and afterDate
are strings in ISO 8601 format. A date range is ignored if currentDate is also specified in the query.All query params are optional, the only rule is that there must be an item or user specified. Defaults are either noted or taken from algorithm values, which themselves may have defaults. This allows very simple queries for the simple, most used cases.
The query returns personalized recommendations, similar items, or a mix including backfill. The query itself determines this by supplying item, user or both. Some examples are:
{
“user”: “xyz”,
“fields”: [
{
“name”: “categories”
“values”: [“series”, “mini-series”],
“bias”: -1 // filter out all except ‘series’ or ‘mini-series’
},{
“name”: “genre”,
“values”: [“sci-fi”, “detective”],
“bias”: 1.02 // boost/favor recommendations with the `genre’ = `sci-fi` or ‘detective’
}
]
}
This returns items based on user "xyz" history filtered by categories and boosted to favor more genre specific items. The values for fields have been attached to items with $set events where the “name” corresponds to a doc field and the “values” correspond to the contents of the field. The “bias” is used to indicate a filter or a boost. For Solr or Elasticsearch the boost is sent as-is to the engine and it’s meaning is determined by the engine (Lucene in either case). As always the blacklist and backfill use the defaults in engine.json.
When the a date is stored in the items properties it can be used in a date range query. This is most often used by the app server since it may know what the range is, while a client query may only know the current date and so use the "Current Date" filter below.
{
“user”: “xyz”,
“fields”: [
{
“name”: “categories”,
“values”: [“series”, “mini-series”],
“bias”: -1 // filter out all except ‘series’ or ‘mini-series’
},{
“name”: “genre”,
“values”: [“sci-fi”, “detective”],
“bias”: 1.02 // boost/favor recommendations with the `genre’ = `sci-fi` or ‘detective’
}
],
"dateRange": {
"name": "availabledate",
"before": "2015-08-15T11:28:45.114-07:00",
"after": "2015-08-20T11:28:45.114-07:00"
}
}
Items are assumed to have a field of the same name
that has a date associated with it using a $set
event. The query will return only those recommendations where the date field is in range. Either date bound can be omitted for a on-sided range. The range applies to all returned recommendations, even those for popular items.
When setting an available date and expire date on items, the current date can be used as a filter, the UR will check that the current date is before the expire date, and after or equal to the available date. You can use either expire date or available date or both. The names of these item fields is specified in the engine.json.
{
“user”: “xyz”,
“fields”: [
{
“name”: “categories”,
“values”: [“series”, “mini-series”],
“bias”: -1 // filter out all except ‘series’ or ‘mini-series’
},{
“name”: “genre”,
“values”: [“sci-fi”, “detective”],
“bias”: 1.02
}
],
"currentDate": "2015-08-15T11:28:45.114-07:00"
}
{
“user”: “xyz”,
"userBias": 2, // favor personal recommendations
“item”: “53454543513”, // fallback to contextual recommendations
“fields”: [
{
“name”: “categories”,
“values”: [“series”, “mini-series”],
“bias”: -1 // filter out all except ‘series’ or ‘mini-series’
},{
“name”: “genre”,
“values”: [“sci-fi”, “detective”],
“bias”: 1.02 // boost/favor recommendations with the `genre’ = `sci-fi` or ‘detective’
}
]
}
This returns items based on user xyz history or similar to item 53454543513 but favoring user history recommendations. These are filtered by categories and boosted to favor more genre specific items.
Note: This query should be considered experimental. Mixing user history with item similarity is possible but may have unexpected results. If you use this you should realize that user and item recommendations may be quite divergent and so mixing the them in query may produce nonsense. Use this only with the engine.json settings for "userbias" and "itembias" to favor one over the other.
{
}
This is a simple way to get popular items. All returned scores will be 0 but the order will be based on relative popularity. Field-based biases for boosts and filters can also be applied.
The Universal takes in potentially many events. These should be seen as a primary event, which is a very clear indication of a user preference and secondary events that we think may tell us something about user "taste" in some way. The Universal Recommender is built on a distributed Correlation Engine so it will test that these secondary events actually relate to the primary one and those that do not correlate will have little or no effect on recommendations (though they will make it longer to train and get query results). It is recommended that you start with one or two events and increase the number as you see how these events effect results and timing.
Events in PredicitonIO are sent to the EventSever in the following form:
{
"event" : "purchase",
"entityType" : "user",
"entityId" : "1243617",
"targetEntityType" : "item",
"targetEntityId" : "iPad",
"properties" : {},
"eventTime" : "2015-10-05T21:02:49.228Z"
}
This is what a "purchase" event looks like. Note that a usage event always is from a user and has a user id. Also the "targetEntityType" is always "item". The actual target entity is implied by the event name. So to create a "category-preference" event you would send something like this:
{
"event" : "category-preference",
"entityType" : "user",
"entityId" : "1243617",
"targetEntityType" : "item",
"targetEntityId" : "electronics",
"properties" : {},
"eventTime" : "2015-10-05T21:02:49.228Z"
}
This event would be sent when the user clicked on the "electronics" category or perhaps purchased an item that was in the "electronics" category. Note that the "targetEntityType" is always "item".
To attach properties to items use a $set event like this:
{
"event" : "$set",
"entityType" : "item",
"entityId" : "ipad",
"properties" : {
"category": ["electronics", "mobile-phones"],
"expireDate": "2016-10-05T21:02:49.228Z",
"availableDate": "2015-10-05T21:02:49.228Z"
},
"eventTime" : "2015-10-05T21:02:49.228Z"
}
Unless a property has a special meaning specified in the engine.json, like date values, the property is assumed to be an array of strings, which act as categorical tags. You can add things like "premium" to the "tier" property then later if the user is a subscriber you can set a filter that allows recommendations from "tier": ["free", "premium"]
where a non subscriber might only get recommendations for "tier": ["free"]
. These are passed in to the query using the "fields"
parameter (see Contextual queries above).
Using properties is how boosts and filters are applied to recommended items. It may seem odd to treat a category as a filter and as a secondary event (category-preference) but the two pieces of data are used in quite different ways. As properties they bias the recommendations, when they are events they add to user data that returns recommendations. In other words as properties they work with boost and filter business rules as secondary usage events they show something about user taste to make recommendations better.
To begin using new data with an engine that has been used with sample data or using different events follow these steps:
engine.json
and change appName
in the new engine.json
pio app new **your-new-app-name**
engine.json
to specify new event names and config values. Make sure "eventNames": ["**your-primary-event**", "**a-secondary-event**", "**another-secondary-event**", ...]
contains the exact string used for your events and that the primary one is first in the list.python examples/**your-python-import-script**.py --access_key **your-access-key**
where the key can be retrieved with pio app list
pio build
, pio train
, and pio deploy
fields
, and items.Integration test: Once PIO and all services are running but before any model is deployed, run ./examples/integration-test
This will print a list of differences in the actual results from the expected results, none means the test passed. Not that the model will remain deployed and will have to be deployed over or killed by pid.
Event name restricted query test: this is for the feature that allows event names to be specified in the query. It restricts the user history that is used to create recommendations and is primarily for use with the MAP@k cross-validation test. The engine config removes the blacklisting of items so it must be used when doing MAP@k calculations. This test uses the simple sample data. Steps to try the test are:
pio app new handmade
python examples/import_handmade.py --access_key <key-from-app-new>
cp engine.json engine.json.orig
cp event-names-test=engine.json engine.json
pio train
pio deploy
./examples/single-eventNames-query.sh
MAP@k: This tests the predictive power of each usage event/indicator. All eventNames used in queries must be removed from the blacklisted events in the engine.json used for a particular dataset. So if "eventNames": ["purchase","view"]
is in the engine.json for the dataset, these events must be removed from the blacklist with "blacklist": []
, which tells the engine to not blacklist items with eventNames
for a user. Allowing blacklisting will artificially lower MAP@k and so not give the desired result.
pio train
time is taken up by writing to Elasticsearch. This can be optimized by creating and ES cluster or giving ES lots of memory.pio deploy
to make the new model active.This Software is licensed under the Apache Software Foundation version 2 licence found here: http://www.apache.org/licenses/LICENSE-2.0