apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.27k stars 1.23k forks source link

Pinot Recommendation Engine does not support BOOLEAN type #7983

Closed lksvenoy-r7 closed 2 years ago

lksvenoy-r7 commented 2 years ago

Current Behavior

The Pinot Recommendation Engine does not support the BOOLEAN type (See the isEmployed dimension)

Recommender Input (Using BOOLEAN type):

{
  "schema":{
    "dimensionFieldSpecs": [
      {
        "cardinality": 10000,
        "dataType": "LONG",
        "name": "studentID"
      },
      {
        "averageLength": 8,
        "cardinality": 2000,
        "dataType": "STRING",
        "name": "firstName"
      },
      {
        "averageLength": 12,
        "cardinality": 2000,
        "dataType": "STRING",
        "name": "lastName"
      },
      {
        "averageLength": 6,
        "cardinality": 2,
        "dataType": "STRING",
        "name": "gender"
      },
      {
        "averageLength": 25,
        "cardinality": 100,
        "dataType": "STRING",
        "name": "subject"
      },
      {
        "cardinality": 2,
        "dataType": "BOOLEAN",
        "name": "isEmployed"
      }
    ],
    "metricFieldSpecs": [
      {
        "cardinality": 5000,
        "dataType": "FLOAT",
        "name": "score"
      }
    ],
    "schemaName": "transcript"
  },
  "queriesWithWeights":{
    "select subject, count(*) from transcript where score > 3 and gender = 'MALE' group by subject": 0.5,
    "select subject, score from transcript where firstName = 'Tsubasa' and lastName = 'Oozora'": 0.5
  },
  "tableType": "OFFLINE",
  "numRecordsPerPush":100000000,
  "qps": 5,
  "latencySLA": 1000,
  "rulesToExecute": {
    "recommendRealtimeProvisioning": false
  }
}

Output

{
  "_code": 400,
  "_error": "java.lang.RuntimeException: number generator can only accept a column of type number and this : BOOLEAN is not a supported number type"
}

Expected Behavior The recommendation engine should be able to deal with the BOOLEAN type by simply converting it to its internal representation (integer)

Workaround Internally BOOLEAN is treated as an integer, and the recommendation engine should respect this. Here is an example using number instead of boolean (which works)

Recommender Input (Using INT type):

{
  "schema":{
    "dimensionFieldSpecs": [
      {
        "cardinality": 10000,
        "dataType": "LONG",
        "name": "studentID"
      },
      {
        "averageLength": 8,
        "cardinality": 2000,
        "dataType": "STRING",
        "name": "firstName"
      },
      {
        "averageLength": 12,
        "cardinality": 2000,
        "dataType": "STRING",
        "name": "lastName"
      },
      {
        "averageLength": 6,
        "cardinality": 2,
        "dataType": "STRING",
        "name": "gender"
      },
      {
        "averageLength": 25,
        "cardinality": 100,
        "dataType": "STRING",
        "name": "subject"
      },
      {
        "cardinality": 2,
        "dataType": "INT",
        "name": "isEmployed"
      }
    ],
    "metricFieldSpecs": [
      {
        "cardinality": 5000,
        "dataType": "FLOAT",
        "name": "score"
      }
    ],
    "schemaName": "transcript"
  },
  "queriesWithWeights":{
    "select subject, count(*) from transcript where score > 3 and gender = 'MALE' group by subject": 0.5,
    "select subject, score from transcript where firstName = 'Tsubasa' and lastName = 'Oozora'": 0.5
  },
  "tableType": "OFFLINE",
  "numRecordsPerPush":100000000,
  "qps": 5,
  "latencySLA": 1000,
  "rulesToExecute": {
    "recommendRealtimeProvisioning": false
  }
}

Output

{
  "realtimeProvisioningRecommendations": {},
  "segmentSizeRecommendations": {
    "message": null,
    "numRowsPerSegment": 33333333,
    "numSegments": 3,
    "segmentSize": 488491328
  },
  "partitionConfig": {
    "numKafkaPartitions": 0,
    "numPartitionsRealtime": 1,
    "partitionDimension": "",
    "numPartitionsOffline": 1,
    "numPartitionsOfflineOverwritten": false,
    "numPartitionsRealtimeOverwritten": false,
    "partitionDimensionOverwritten": false
  },
  "flaggedQueries": {
    "flaggedQueries": {}
  },
  "indexConfig": {
    "sortedColumnOverwritten": true,
    "invertedIndexColumns": [
      "gender"
    ],
    "noDictionaryColumns": [
      "studentID",
      "score",
      "isEmployed"
    ],
    "onHeapDictionaryColumns": [],
    "varLengthDictionaryColumns": [
      "firstName",
      "lastName",
      "gender",
      "subject"
    ],
    "sortedColumn": "firstName",
    "bloomFilterColumns": [],
    "rangeIndexColumns": [
      "score"
    ]
  },
  "aggregateMetrics": false
}
Jackie-Jiang commented 2 years ago

We should probably add a BooleanGenerator. Do you want to help contribute a fix?

lksvenoy-r7 commented 2 years ago

Sure I'd love to, but its going to take me a little while to get around to it.

On Fri, 14 Jan 2022, 17:48 Xiaotian (Jackie) Jiang, < @.***> wrote:

We should probably add a BooleanGenerator. Do you want to help contribute a fix?

— Reply to this email directly, view it on GitHub https://github.com/apache/pinot/issues/7983#issuecomment-1013330546, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFZDWSYDUQJDFXI5UDJU53UWBOVFANCNFSM5LQVRHAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

-- NOTICE OF CONFIDENTIALITY: At Rapid7, the privacy of our customers, partners, and employees is paramount. If you received this email in error, please notify the sender and delete it from your inbox right away. Learn how Rapid7 handles privacy at rapid7.com/privacy-policy https://www.rapid7.com/privacy-policy/. To opt-out of Rapid7 marketing emails, please click here https://information.rapid7.com/communication-preferences.html or email  @. @.>.

lksvenoy-r7 commented 2 years ago

Here is the PR: https://github.com/apache/pinot/pull/8055/files

lksvenoy-r7 commented 2 years ago

Closing this issue as this was fixed in https://github.com/apache/pinot/pull/8055/files