apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.42k stars 1.27k forks source link

Null in all columns on Batch ingesting ORC data from S3 #8460

Closed stym06 closed 2 years ago

stym06 commented 2 years ago

Hey guys, I've been trying to ingest data stored on S3 in ORC format using the Pinot ingestor with the below command: ./pinot-admin.sh LaunchDataIngestionJob -jobSpecFile batch-job-standalone-spec.yaml

Ingestion job spec

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'
jobType: SegmentCreationAndMetadataPush
inputDirURI: 's3://test-bucket/dev/pinot-input-new/'
outputDirURI: 's3://test-bucket/dev/pinot/axon_entity.db/segments-v2'
overwriteOutput: true
pinotFSSpecs:
  - scheme: s3
    className: org.apache.pinot.plugin.filesystem.S3PinotFS
    configs:
      region: ap-southeast-1
recordReaderSpec:
  dataFormat: 'orc'
  className: 'org.apache.pinot.plugin.inputformat.orc.ORCRecordReader'
tableSpec:
  tableName: 'user_base_fact'
  schemaURI: 'http://localhost:9000/tables/user_base_fact/schema'
  tableConfigURI: 'http://localhost:9000/tables/user_base_fact'
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

The job is able to complete but leads to all null values in the Pinot table:

Screenshot 2022-04-04 at 3 13 38 PM

However upon reading the ORC file, I'm getting the values:

java -jar orc-tools-1.5.5-uber.jar data 000000_0 | jq . | head -100

{
  "_col0": "750",
  "_col1": "customer",
  "_col2": "mumbai",
  "_col3": "mumbai",
  "_col4": "micro",
  "_col5": "active",
  "_col6": "sensitive",
  "_col7": "FALSE",
  "_col8": "IN",
  "_col9": "5.6.6",
  "_col10": "FALSE",
  "_col11": "FALSE",
  "_col12": "FALSE",
  "_col13": 1,
  "_col14": 0,
  "_col15": 0,
  "_col16": "mumbai",
  "_col17": "2015-11-12 14:28:16",
  "_col18": "2017-05-08",
  "_col19": null,
  "_col20": "2015-11-12",
  "_col21": null,
  "_col22": null
}

Schema spec

{
  "schemaName": "user_base_fact",
  "dimensionFieldSpecs": [
    {
      "name": "entity_id",
      "dataType": "STRING"
    },
    {
      "name": "entity_type",
      "dataType": "STRING"
    },
    {
      "name": "primary_city",
      "dataType": "STRING"
    },
    {
      "name": "last_ride_city",
      "dataType": "STRING"
    },
    {
      "name": "prefcat",
      "dataType": "STRING"
    },
    {
      "name": "activity",
      "dataType": "STRING"
    },
    {
      "name": "price_sensitivity",
      "dataType": "STRING"
    },
    {
      "name": "is_corporate",
      "dataType": "STRING"
    },
    {
      "name": "country_filter",
      "dataType": "STRING"
    },
    {
      "name": "app_version",
      "dataType": "STRING"
    },
    {
      "name": "select_active",
      "dataType": "STRING"
    },
    {
      "name": "sharepass_active",
      "dataType": "STRING"
    },
    {
      "name": "cabpass_active",
      "dataType": "STRING"
    },
    {
      "name": "fallback_city",
      "dataType": "STRING"
    },
    {
      "name": "first_ride_timestamp",
      "dataType": "STRING"
    },
    {
      "name": "auto_first_ride_date",
      "dataType": "STRING"
    },
    {
      "name": "bike_first_ride_date",
      "dataType": "STRING"
    },
    {
      "name": "mmpp_first_ride_date",
      "dataType": "STRING"
    },
    {
      "name": "offer_tag",
      "dataType": "STRING"
    },
    {
      "name": "offer",
      "dataType": "STRING"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "select_bought",
      "dataType": "INT"
    },
    {
      "name": "sharepass_bought",
      "dataType": "INT"
    },
    {
      "name": "cabpass_bought",
      "dataType": "INT"
    }
  ]
}

Table spec

{
  "OFFLINE": {
    "tableName": "user_base_fact_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "segmentPushFrequency": "DAILY",
      "replication": "1",
      "replicasPerPartition": "1",
      "segmentPushType": "APPEND",
      "schemaName": "user_base_fact"
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "noDictionaryColumns": [],
      "rangeIndexColumns": [],
      "rangeIndexVersion": 2,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false
    },
    "metadata": {},
    "quota": {},
    "routing": {},
    "query": {},
    "fieldConfigList": [],
    "ingestionConfig": {
      "transformConfigs": []
    },
    "isDimTable": false
  }
}
KKcorps commented 2 years ago

Schema fields do not seem to match ORC field names.

stym06 commented 2 years ago

Thanks @KKcorps . changing the column names worked!