apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.39k stars 1.26k forks source link

Json ingestion failing for some array types #8635

Open ryanruaneyougov opened 2 years ago

ryanruaneyougov commented 2 years ago

I have found that I can ingest using JSON all types as multi-valued dimension columns with the exception of BOOLEAN, TIMESTAMP, and BYTES. I believe that JSON_ARRAY isn't a valid type, but I wasn't sure about BYTES_ARRAY. If the muli-valued versions of those types are removed from the schema and data files below, ingestion succeeds and I can inspect the table in the pinot browser. If anyone is about and can shed some light, I would be very appreciative.

Json Ingestion

For BYTES_ARRAY I get:

java.lang.UnsupportedOperationException: Unsupported data type : BYTES

For BOOLEAN_ARRAY I get:

java.lang.ClassCastException: class [Z cannot be cast to class java.lang.Integer ([Z and java.lang.Integer are in module java.base of loader 'bootstrap')

For TIMESTAMP_ARRAY I get:

java.lang.ClassCastException: class java.sql.Timestamp cannot be cast to class java.lang.Long (java.sql.Timestamp is in module java.sql of loader 'platform'; java.lang.Long is in module java.base of loader 'bootstrap')

Here is my cluster:

version: "3"

services:
  pinot-zookeeper:
    image: apachepinot/pinot:release-0.10.0
    hostname: pinot-zookeeper
    container_name: "pinot-client-rust-pinot-zookeeper"
    ports:
      - "2181:2181"
    command: StartZookeeper

  pinot-controller:
    image: apachepinot/pinot:release-0.10.0
    hostname: pinot-controller
    container_name: "pinot-client-rust-pinot-controller"
    volumes:
      - ./db:/db
    ports:
      - "9000:9000"
    command: StartController -zkAddress pinot-zookeeper:2181
    depends_on:
      - pinot-zookeeper

  pinot-broker:
    image: apachepinot/pinot:release-0.10.0
    hostname: pinot-broker
    container_name: "pinot-client-rust-pinot-broker"
    volumes:
      - ./db:/db
    ports:
      - "8099:8099"
    command: StartBroker -zkAddress pinot-zookeeper:2181
    restart: unless-stopped
    depends_on:
      - pinot-zookeeper
      - pinot-controller

  pinot-server:
    image: apachepinot/pinot:release-0.10.0
    hostname: pinot-server
    container_name: "pinot-client-rust-pinot-server"
    volumes:
      - ./db:/db
    ports:
      - "8098:8098"
    command: StartServer -zkAddress pinot-zookeeper:2181
    depends_on:
      - pinot-zookeeper
      - pinot-controller

# cargo will try to redownload packages @ docker-compose up so store them here.
volumes:
  pgdata: {}

Here is my schema:

{
  "schemaName": "scoreSheet",
  "dimensionFieldSpecs": [
    {
      "name": "handle",
      "dataType": "STRING"
    },
    {
      "name": "names",
      "dataType": "STRING",
      "singleValueField": false
    },
    {
      "name": "age",
      "dataType": "INT"
    },    
    {
      "name": "gameIds",
      "dataType": "INT",
      "singleValueField": false
    },
    {
      "name": "hasPlayed",
      "dataType": "BOOLEAN"
    },
    {
      "name": "gamesWon",
      "dataType": "BOOLEAN",
      "singleValueField": false
    },
    {
      "name": "dateOfBirth",
      "dataType": "TIMESTAMP"
    },
    {
      "name": "datesPlayed",
      "dataType": "TIMESTAMP",
      "singleValueField": false
    },
    {
      "name": "scores",
      "dataType": "LONG",
      "singleValueField": false
    },
    {
      "name": "handicapAdjustedScores",
      "dataType": "FLOAT",
      "singleValueField": false
    },
    {
      "name": "handicapAdjustedScores_highPrecision",
      "dataType": "FLOAT",
      "singleValueField": false
    },
    {
      "name": "extra",
      "dataType": "JSON"
    },
    {
      "name": "raw",
      "dataType": "BYTES"
    },
    {
      "name": "rawArray",
      "dataType": "BYTES",
      "singleValueField": false
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "totalScore",
      "dataType": "LONG"
    },
    {
      "name": "avgScore",
      "dataType": "FLOAT"
    },
    {
      "name": "avgScore_highPrecision",
      "dataType": "DOUBLE"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "dateOfFirstGame",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

Here is my table:

{
    "tableName": "scoreSheet",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "replication": 1
    },
    "tenants": {
      "broker":"DefaultTenant",
      "server":"DefaultTenant"
    },
    "tableIndexConfig": {
      "loadMode": "MMAP"
    },
    "ingestionConfig": {
      "batchIngestionConfig": {
        "segmentIngestionType": "APPEND",
        "segmentIngestionFrequency": "DAILY"
      }
    },
    "metadata": {}
}

Here is my ingestion job:

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/db/score_sheet'
# includeFileNamePattern: 'glob:**/data.csv'
includeFileNamePattern: 'glob:**/data.json'
outputDirURI: '/opt/pinot/data/score_sheet'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  # dataFormat: 'csv'
  dataFormat: 'json'
  # className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
  # configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'scoreSheet'
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'

Here is my data:

[
  {
    "names": ["James", "Smith"],
    "gameIds": [1, 2, 3],
    "datesPlayed": ["2020-01-01 10:45:28", "2020-02-01 10:45:28", "2020-03-01 10:45:28"],
    "gamesWon": [true, false, true],
    "scores": [3, 6, 2],
    "handicapAdjustedScores": [2.1, 4.9, 3.2],
    "handicapAdjustedScores_highPrecision": [2.15, 4.99, 3.21],
    "rawArray": ["cd", "ef"],
    "handle": "Gladiator",
    "age": 10,
    "totalScore": 11,
    "avgScore": 3.6,
    "avgScore_highPrecision": 3.66,
    "hasPlayed": true,
    "dateOfBirth": "2011-01-01 00:00:00",
    "dateOfFirstGame": 1577875528000,
    "extra": "{\"a\": \"b\"}",
    "raw": "ab"
  },
  {
    "names": ["Giles", "Richie"],
    "gameIds": [],
    "datesPlayed":[] ,
    "gamesWon": [],
    "scores": [],
    "handicapAdjustedScores": [],
    "handicapAdjustedScores_highPrecision": [],
    "rawArray": [],
    "handle": "Thrumbar",
    "age": 30,
    "totalScore": 0,
    "avgScore": 0,
    "avgScore_highPrecision": 0,
    "hasPlayed": false,
    "dateOfBirth": 662688000000,
    "dateOfFirstGame": 1420070400001,
    "extra": {},
    "raw": ""
  }
]

CSV Ingestion

Using the following ingestion job, schema, and csv file, an INT was ingested to an INT multi-value column:

Ingestion job:

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/db/score_sheet'
includeFileNamePattern: 'glob:**/data.csv'
# includeFileNamePattern: 'glob:**/data.json'
outputDirURI: '/opt/pinot/data/score_sheet'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  # dataFormat: 'json'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  # className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'scoreSheet'
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'

Schema:

{
  "schemaName": "scoreSheet",
  "dimensionFieldSpecs": [
    {
      "name": "gamesWon",
      "dataType": "INT",
      "singleValueField": false
    }
  ],
  "metricFieldSpecs": [
  ],
  "dateTimeFieldSpecs": [
  ]
}

CSV:

gamesWon
1

However, the aforementioned error from json ingestion presents when tried for booleans:

Error:

2022/05/05 08:44:40.594 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] Failed to generate Pinot segment for file - file:/db/score_sheet/data.json
java.lang.ClassCastException: class [Z cannot be cast to class java.lang.Integer ([Z and java.lang.Integer are in module java.base of loader 'bootstrap')

Schema:

{
  "schemaName": "scoreSheet",
  "dimensionFieldSpecs": [
    {
      "name": "gamesWon",
      "dataType": "BOOLEAN",
      "singleValueField": false
    }
  ],
  "metricFieldSpecs": [
  ],
  "dateTimeFieldSpecs": [
  ]
}

CSV:

gamesWon
true

and

gamesWon
1
KKcorps commented 2 years ago

The java.lang.UnsupportedOperationException: Unsupported data type : BYTES seems due to missing case for BYTES in dictionary creator. @Jackie-Jiang SegmentDictionaryCreator.indexOfMV doesn't have BYTES case handled. Is that by design?

KKcorps commented 2 years ago

The other errors are due to casting in PreIndexStatsCollector classes. e.g. long value = (long) entry; which throws error when entry is of type Timestamp. This definitely seems like a bug to me. Will work on the fix.

Jackie-Jiang commented 2 years ago

Initially we don't support BOOLEAN, TIMESTAMP, BYTES as MV, and the support is added recently. Some paths might be missed, and we should fix them. cc @richardstartin