When a bias -1 is applied to a query, the items returned are only the ones that have received events during the last few days, instead of extending the results to all the items available in the model. See dataset below.
Few things to know
This bug is not due to a popularity backfill VS recommendations
This bug is not due to the 3 days default set in backfillField: extending this period to 30 days doesn't change the query results
This bug is different than the "bias" bug that was already fixed in v0.3.0
Here is an full example to reproduce the issue:
DATASET, saved in data/ks_events_plays.json.
Please note that all the items have a subdomain=staff property
IMPORTANT: the events eventTime play an important role in this issue. Make sure that the events are spread across time, starting on the day you reload the dataset. This will ensure that the bug rears it's ugly head!
{
"comment":" This config file uses default settings for all but the required values see README.md for docs",
"id": "default",
"description": "Default settings",
"engineFactory": "com.kanopy.RecommendationEngine",
"datasource": {
"params" : {
"appName": "ksrec",
"eventNames": ["play"]
}
},
"sparkConf": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m",
"spark.executor.memory": "8g",
"es.index.auto.create": "true"
},
"algorithms": [
{
"comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",
"name": "ur",
"params": {
"appName": "ksrec",
"indexName": "urindex",
"typeName": "items",
"comment": "must have data for the first event or the model will not build, other events are optional",
"eventNames": ["play"],
"backfillField": {
"backfillType": "popular",
"eventnames": ["play"],
"duration": 2592000
},
}
}
]
}
Importing / Training / Deploying
1) Deleting any existing data in ksrec
root@rec-2:~/ksrec# pio app data-delete ksrec
2) Importing the dataset
root@rec-2:~/ksrec# pio import --appid 1 --input data/ks_events_plays.json
3) Training
root@rec-2:~/ksrec# pio train -- --driver-memory 8G
4) Deploy
root@rec-2:~/ksrec# pio deploy --port 8000
Query 1 and Query 2 should be returning the same resultset, but Query 2 returns a much smaller resultset: 3 items vs 8 items in Query 1.
It appears that the 3 items returned are the ones that have received a "play" event in the previous couple of days.
Depending on the events EventTime you have set up in the dataset, you might see different numbers, but there should be a disconnect between the two query results.
After discussing with @pferrel on the Google Group, it seems that there is an issue with the way the UR is filtering results. See https://groups.google.com/forum/#!topic/predictionio-user/sXz7DoWqK3o
When a bias -1 is applied to a query, the items returned are only the ones that have received events during the last few days, instead of extending the results to all the items available in the model. See dataset below.
Few things to know
Here is an full example to reproduce the issue:
DATASET, saved in data/ks_events_plays.json.
Please note that all the items have a subdomain=staff property
IMPORTANT: the events eventTime play an important role in this issue. Make sure that the events are spread across time, starting on the day you reload the dataset. This will ensure that the bug rears it's ugly head!
engine.json
Importing / Training / Deploying
1) Deleting any existing data in ksrec
root@rec-2:~/ksrec# pio app data-delete ksrec
2) Importing the dataset
root@rec-2:~/ksrec# pio import --appid 1 --input data/ks_events_plays.json
3) Training
root@rec-2:~/ksrec# pio train -- --driver-memory 8G
4) Deploy
root@rec-2:~/ksrec# pio deploy --port 8000
Raw CURL Queries
Query 1
Query 2
Executing these queries:
Query 1
Query 2
Expected Results
Query 1 and Query 2 should be returning the same resultset, but Query 2 returns a much smaller resultset: 3 items vs 8 items in Query 1. It appears that the 3 items returned are the ones that have received a "play" event in the previous couple of days.
Depending on the events EventTime you have set up in the dataset, you might see different numbers, but there should be a disconnect between the two query results.
Thanks for looking into this.