actionml / template-scala-parallel-universal-recommendation

30 stars 21 forks source link

Bias -1 filters are filtering too much #9

Closed SimonDeconde closed 8 years ago

SimonDeconde commented 8 years ago

After discussing with @pferrel on the Google Group, it seems that there is an issue with the way the UR is filtering results. See https://groups.google.com/forum/#!topic/predictionio-user/sXz7DoWqK3o

When a bias -1 is applied to a query, the items returned are only the ones that have received events during the last few days, instead of extending the results to all the items available in the model. See dataset below.

Few things to know

Here is an full example to reproduce the issue:

DATASET, saved in data/ks_events_plays.json.

Please note that all the items have a subdomain=staff property

IMPORTANT: the events eventTime play an important role in this issue. Make sure that the events are spread across time, starting on the day you reload the dataset. This will ensure that the bug rears it's ugly head!

{"event":"$set","entityType":"item","entityId":"video0","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"$set","entityType":"item","entityId":"video1","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"$set","entityType":"item","entityId":"video2","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"$set","entityType":"item","entityId":"video3","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"$set","entityType":"item","entityId":"video4","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"$set","entityType":"item","entityId":"video5","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"$set","entityType":"item","entityId":"video6","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"$set","entityType":"item","entityId":"video7","properties":{"subdomains":["staff", "member", "public"]}}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video0","eventTime":"2016-03-13T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video1","eventTime":"2016-03-12T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video2","eventTime":"2016-03-11T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video3","eventTime":"2016-03-10T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video4","eventTime":"2016-03-09T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video5","eventTime":"2016-03-08T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video6","eventTime":"2016-03-07T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"1","targetEntityType":"item","targetEntityId":"video7","eventTime":"2016-03-06T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video0","eventTime":"2016-03-13T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video1","eventTime":"2016-03-12T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video2","eventTime":"2016-03-11T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video3","eventTime":"2016-03-10T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video4","eventTime":"2016-03-09T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video5","eventTime":"2016-03-08T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video6","eventTime":"2016-03-07T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"2","targetEntityType":"item","targetEntityId":"video7","eventTime":"2016-03-06T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"video0","eventTime":"2016-03-13T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"video1","eventTime":"2016-03-12T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"video2","eventTime":"2016-03-11T01:58:16-0800"}
{"event":"play","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"video3","eventTime":"2016-03-10T01:58:16-0800"}

engine.json

{
  "comment":" This config file uses default settings for all but the required values see README.md for docs",
  "id": "default",
  "description": "Default settings",
  "engineFactory": "com.kanopy.RecommendationEngine",
  "datasource": {
    "params" : {
      "appName": "ksrec",
      "eventNames": ["play"]
    }
  },
  "sparkConf": {
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.executor.memory": "8g",
    "es.index.auto.create": "true"
  },
  "algorithms": [
    {
      "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",
      "name": "ur",
      "params": {
        "appName": "ksrec",
        "indexName": "urindex",
        "typeName": "items",
        "comment": "must have data for the first event or the model will not build, other events are optional",
        "eventNames": ["play"],
        "backfillField": {
            "backfillType": "popular",
            "eventnames": ["play"],
            "duration": 2592000
        },
      }
    }
  ]
}

Importing / Training / Deploying

1) Deleting any existing data in ksrec root@rec-2:~/ksrec# pio app data-delete ksrec

2) Importing the dataset root@rec-2:~/ksrec# pio import --appid 1 --input data/ks_events_plays.json

3) Training root@rec-2:~/ksrec# pio train -- --driver-memory 8G

4) Deploy root@rec-2:~/ksrec# pio deploy --port 8000

Raw CURL Queries

Query 1

curl -H "Content-Type: application/json" \
-d '{ "user": "100", "num": 10 }' http://localhost:8000/queries.json

Query 2

curl -H "Content-Type: application/json" \
-d '{ "user": "100", "num": 10, "fields": [{"name": "subdomains", "values": ["staff"], "bias": -1}] }' http://localhost:8000/queries.json

Executing these queries:

Query 1

root@rec-2:~/ksrec# curl -H "Content-Type: application/json" \
> -d '{ "user": "100", "num": 10 }' http://localhost:8000/queries.json
{"itemScores":[{"item":"video1","score":0.0},{"item":"video2","score":0.0},{"item":"video0","score":0.0},{"item":"video6","score":0.0},{"item":"video7","score":0.0},{"item":"video3","score":0.0},{"item":"video4","score":0.0},{"item":"video5","score":0.0}]}

Query 2

root@rec-2:~/ksrec# curl -H "Content-Type: application/json" \
> -d '{ "user": "100", "num": 10, "fields": [{"name": "subdomains", "values": ["staff"], "bias": -1}] }' http://localhost:8000/queries.json
{"itemScores":[{"item":
"video1","score":0.0},{"item":"video2","score":0.0},{"item":"video0","score":0.0}]}

Expected Results

Query 1 and Query 2 should be returning the same resultset, but Query 2 returns a much smaller resultset: 3 items vs 8 items in Query 1. It appears that the 3 items returned are the ones that have received a "play" event in the previous couple of days.

Depending on the events EventTime you have set up in the dataset, you might see different numbers, but there should be a disconnect between the two query results.

Thanks for looking into this.

pferrel commented 8 years ago

The default for the "pop-model" is duration of 10 years now so I took out the duration in the engine.json.

Tests pass, identical results.

pferrel commented 8 years ago

Nothing to fix, may have been a bug on 0.2.3 but not in 0.3.0