Closed stym06 closed 2 years ago
Hey guys, I've been trying to ingest data stored on S3 in ORC format using the Pinot ingestor with the below command: ./pinot-admin.sh LaunchDataIngestionJob -jobSpecFile batch-job-standalone-spec.yaml
./pinot-admin.sh LaunchDataIngestionJob -jobSpecFile batch-job-standalone-spec.yaml
executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner' jobType: SegmentCreationAndMetadataPush inputDirURI: 's3://test-bucket/dev/pinot-input-new/' outputDirURI: 's3://test-bucket/dev/pinot/axon_entity.db/segments-v2' overwriteOutput: true pinotFSSpecs: - scheme: s3 className: org.apache.pinot.plugin.filesystem.S3PinotFS configs: region: ap-southeast-1 recordReaderSpec: dataFormat: 'orc' className: 'org.apache.pinot.plugin.inputformat.orc.ORCRecordReader' tableSpec: tableName: 'user_base_fact' schemaURI: 'http://localhost:9000/tables/user_base_fact/schema' tableConfigURI: 'http://localhost:9000/tables/user_base_fact' pinotClusterSpecs: - controllerURI: 'http://localhost:9000' pushJobSpec: pushParallelism: 2 pushAttempts: 2 pushRetryIntervalMillis: 1000
The job is able to complete but leads to all null values in the Pinot table:
However upon reading the ORC file, I'm getting the values:
java -jar orc-tools-1.5.5-uber.jar data 000000_0 | jq . | head -100 { "_col0": "750", "_col1": "customer", "_col2": "mumbai", "_col3": "mumbai", "_col4": "micro", "_col5": "active", "_col6": "sensitive", "_col7": "FALSE", "_col8": "IN", "_col9": "5.6.6", "_col10": "FALSE", "_col11": "FALSE", "_col12": "FALSE", "_col13": 1, "_col14": 0, "_col15": 0, "_col16": "mumbai", "_col17": "2015-11-12 14:28:16", "_col18": "2017-05-08", "_col19": null, "_col20": "2015-11-12", "_col21": null, "_col22": null }
{ "schemaName": "user_base_fact", "dimensionFieldSpecs": [ { "name": "entity_id", "dataType": "STRING" }, { "name": "entity_type", "dataType": "STRING" }, { "name": "primary_city", "dataType": "STRING" }, { "name": "last_ride_city", "dataType": "STRING" }, { "name": "prefcat", "dataType": "STRING" }, { "name": "activity", "dataType": "STRING" }, { "name": "price_sensitivity", "dataType": "STRING" }, { "name": "is_corporate", "dataType": "STRING" }, { "name": "country_filter", "dataType": "STRING" }, { "name": "app_version", "dataType": "STRING" }, { "name": "select_active", "dataType": "STRING" }, { "name": "sharepass_active", "dataType": "STRING" }, { "name": "cabpass_active", "dataType": "STRING" }, { "name": "fallback_city", "dataType": "STRING" }, { "name": "first_ride_timestamp", "dataType": "STRING" }, { "name": "auto_first_ride_date", "dataType": "STRING" }, { "name": "bike_first_ride_date", "dataType": "STRING" }, { "name": "mmpp_first_ride_date", "dataType": "STRING" }, { "name": "offer_tag", "dataType": "STRING" }, { "name": "offer", "dataType": "STRING" } ], "metricFieldSpecs": [ { "name": "select_bought", "dataType": "INT" }, { "name": "sharepass_bought", "dataType": "INT" }, { "name": "cabpass_bought", "dataType": "INT" } ] }
{ "OFFLINE": { "tableName": "user_base_fact_OFFLINE", "tableType": "OFFLINE", "segmentsConfig": { "segmentPushFrequency": "DAILY", "replication": "1", "replicasPerPartition": "1", "segmentPushType": "APPEND", "schemaName": "user_base_fact" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [], "noDictionaryColumns": [], "rangeIndexColumns": [], "rangeIndexVersion": 2, "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "sortedColumn": [], "bloomFilterColumns": [], "loadMode": "MMAP", "onHeapDictionaryColumns": [], "varLengthDictionaryColumns": [], "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false }, "metadata": {}, "quota": {}, "routing": {}, "query": {}, "fieldConfigList": [], "ingestionConfig": { "transformConfigs": [] }, "isDimTable": false } }
Schema fields do not seem to match ORC field names.
Thanks @KKcorps . changing the column names worked!
Hey guys, I've been trying to ingest data stored on S3 in ORC format using the Pinot ingestor with the below command:
./pinot-admin.sh LaunchDataIngestionJob -jobSpecFile batch-job-standalone-spec.yaml
Ingestion job spec
The job is able to complete but leads to all null values in the Pinot table:
However upon reading the ORC file, I'm getting the values:
Schema spec
Table spec