Closed arw357 closed 5 years ago
@bvaradar :) Hive expert..
I ran into this as well
scala> CompactApInvoicesAll.saveToHudiTable(
| deltas,
| tableName = "ap_invoices_all_hudi",
| tablePath = "hdfs:///tmp/hudi-test/",
| primaryKey = "invoice_id",
| pkOrderingCol = "capture_timestamp",
| enableHiveSync = true
| )
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
com.uber.hoodie.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE IF NOT EXISTS default.ap_invoices_all_hudi( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `invoice_id` string, `last_update_date` bigint, `last_updated_by` string, `vendor_id` string, `invoice_num` string, `set_of_books_id` string, `invoice_currency_code` string, `payment_currency_code` string, `payment_cross_rate` string, `invoice_amount` string, `vendor_site_id` string, `amount_paid` string, `discount_amount_taken` string, `invoice_date` bigint, `source` string, `invoice_type_lookup_code` string, `description` string, `batch_id` string, `amount_applicable_to_discount` string, `tax_amount` string, `terms_id` string, `terms_date` bigint, `payment_method_lookup_code` string, `pay_group_lookup_code` string, `accts_pay_code_combination_id` string, `payment_status_flag` string, `creation_date` bigint, `created_by` string, `base_amount` string, `vat_code` string, `last_update_login` string, `exclusive_payment_flag` string, `po_header_id` string, `freight_amount` string, `goods_received_date` bigint, `invoice_received_date` bigint, `voucher_num` string, `approved_amount` string, `recurring_payment_id` string, `exchange_rate` string, `exchange_rate_type` string, `exchange_date` bigint, `earliest_settlement_date` bigint, `original_prepayment_amount` string, `doc_sequence_id` string, `doc_sequence_value` string, `doc_category_code` string, `attribute1` string, `attribute2` string, `attribute3` string, `attribute4` string, `attribute5` string, `attribute6` string, `attribute7` string, `attribute8` string, `attribute9` string, `attribute10` string, `attribute11` string, `attribute12` string, `attribute13` string, `attribute14` string, `attribute15` string, `attribute_category` string, `approval_status` string, `approval_description` string, `invoice_distribution_total` string, `posting_status` string, `prepay_flag` string, `authorized_by` string, `cancelled_date` bigint, `cancelled_by` string, `cancelled_amount` string, `temp_cancelled_amount` string, `project_accounting_context` string, `ussgl_transaction_code` string, `ussgl_trx_code_context` string, `project_id` string, `task_id` string, `expenditure_type` string, `expenditure_item_date` bigint, `pa_quantity` string, `expenditure_organization_id` string, `pa_default_dist_ccid` string, `vendor_prepay_amount` string, `payment_amount_total` string, `awt_flag` string, `awt_group_id` string, `reference_1` string, `reference_2` string, `org_id` string, `pre_withholding_amount` string, `global_attribute_category` string, `global_attribute1` string, `global_attribute2` string, `global_attribute3` string, `global_attribute4` string, `global_attribute5` string, `global_attribute6` string, `global_attribute7` string, `global_attribute8` string, `global_attribute9` string, `global_attribute10` string, `global_attribute11` string, `global_attribute12` string, `global_attribute13` string, `global_attribute14` string, `global_attribute15` string, `global_attribute16` string, `global_attribute17` string, `global_attribute18` string, `global_attribute19` string, `global_attribute20` string, `auto_tax_calc_flag` string, `payment_cross_rate_type` string, `payment_cross_rate_date` bigint, `pay_curr_invoice_amount` string, `mrc_base_amount` string, `mrc_exchange_rate` string, `mrc_exchange_rate_type` string, `mrc_exchange_date` string, `mrc_posting_status` string, `paid_on_behalf_employee_id` string, `amt_due_ccard_company` string, `amt_due_employee` string, `gl_date` bigint, `award_id` string, `approval_ready_flag` string, `approval_iteration` string, `wfapproval_status` string, `requester_id` string, `validation_request_id` string, `validated_tax_amount` string, `quick_credit` string, `credited_invoice_id` string, `distribution_set_id` string, `application_id` string, `product_table` string, `reference_key1` string, `reference_key2` string, `reference_key3` string, `reference_key4` string, `reference_key5` string, `total_tax_amount` string, `self_assessed_tax_amount` string, `tax_related_invoice_id` string, `trx_business_category` string, `user_defined_fisc_class` string, `taxation_country` string, `document_sub_type` string, `supplier_tax_invoice_number` string, `supplier_tax_invoice_date` bigint, `supplier_tax_exchange_rate` string, `tax_invoice_recording_date` bigint, `tax_invoice_internal_seq` string, `legal_entity_id` string, `historical_flag` string, `force_revalidation_flag` string, `bank_charge_bearer` string, `remittance_message1` string, `remittance_message2` string, `remittance_message3` string, `unique_remittance_identifier` string, `uri_check_digit` string, `settlement_priority` string, `payment_reason_code` string, `payment_reason_comments` string, `payment_method_code` string, `delivery_channel_code` string, `quick_po_header_id` string, `net_of_retainage_flag` string, `release_amount_net_of_tax` string, `control_amount` string, `party_id` string, `party_site_id` string, `pay_proc_trxn_type_code` string, `payment_function` string, `cust_registration_code` string, `cust_registration_number` string, `port_of_entry_code` string, `external_bank_account_id` string, `vendor_contact_id` string, `internal_contact_email` string, `disc_is_inv_less_tax_flag` string, `exclude_freight_from_discount` string, `pay_awt_group_id` string, `original_invoice_amount` string, `dispute_reason` string, `remit_to_supplier_name` string, `remit_to_supplier_id` string, `remit_to_supplier_site` string, `remit_to_supplier_site_id` string, `relationship_id` string, `capture_timestamp` bigint, `integ_key` string, `op_type` string, `updatedby_user` string) PARTITIONED BY ( String) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs:/tmp/hudi-test'
at com.uber.hoodie.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:457)
at com.uber.hoodie.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:260)
at com.uber.hoodie.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:129)
at com.uber.hoodie.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
at com.uber.hoodie.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
at com.uber.hoodie.DefaultSource.syncHive(DefaultSource.scala:272)
at com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:247)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
at CompactApInvoicesAll$.saveToHudiTable(CompactApInvoicesAll.scala:54)
... 56 elided
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:5717 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
at org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
at org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
at com.uber.hoodie.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:455)
... 82 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:5717 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy41.executeStatementAsync(Unknown Source)
at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530)
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:5717 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:211)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)
at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
... 27 more
CREATE EXTERNAL TABLE IF NOT EXISTS default.ap_invoices_all_hudi( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, ...`integ_key` string, `op_type` string, `updatedby_user` string)
PARTITIONED BY ( String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs:/tmp/hudi-test'
Will look into this tonight.
@AndrewKL : BTW, can you let us know the Hive and Hadoop versions you are using.
Here's some more detailed info aws emr 5.19.0 spark 2.3.1 hadoop 2.8.5 apache-hive-2.3.3-amzn-1-src glue catalogue enabled
val fs: FileSystem = FileSystem.get(delta.sparkSession.sparkContext.hadoopConfiguration)
val writer = delta.write.format("com.uber.hoodie") // specify the hoodie source
// .option("hoodie.insert.shuffle.parallelism", "2")
// .option("hoodie.upsert.shuffle.parallelism", "2") //TODO come up with a sane level of parallelism
// .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType) //default copy on write
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, primaryKey)
//.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition") this data is unpartitioned
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, pkOrderingCol)
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
// .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, hiveJdbcUrl)
// .option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, hiveUser)
// .option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, hivePass)
.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, enableHiveSync)//this crashes if enable due to un partitioned data
.mode(SaveMode.Append)
writer.save(tablePath)
val commitTime = HoodieDataSourceHelpers.latestCommit(fs, tablePath)
println("commit at instant time :" + commitTime)
writer.save(tablePath)
Are unpartitioned tables supported by hudi?
Sorry for the delay.
@arw357 : Thanks for debugging the issue. I have a pending PR ( https://github.com/uber/hudi/pull/515) that fixes it. Please use it if you are planning to try hoodie.
@AndrewKL : Hoodie is supposed to work with both partitioned and non-partitioned hive tables. There were some minor issues with partition-path handling which this PR (https://github.com/uber/hudi/pull/515) addresses. I have also added support in HoodieJavaApp and in integration-test to test non-partitioned table. Please try it out when you get a chance.
I'll update my code and give it a shot
I'm still in the middle of testing this buti found another minor bug. Currently the HiveSyncTool requires a partition col.
com.uber.hoodie.com.beust.jcommander.ParameterException: The following options are required: [--user], [--partitioned-by], [--jdbc-url], [--pass]
@AndrewKL non partitioned tables must be supported now.. (@bvaradar fyi).. close this?
The hive partition column at hive table creation time defaults to "String" type because the type of the partition column name is searched in a map which has it's keys surrounded by backticks ("`") ,thus the key is not found.
This is where the partition key type is found by the code : https://github.com/uber/hudi/blob/5cb28e7b1ff09149ca88fc761d0e19369c6c3505/hoodie-hive/src/main/java/com/uber/hoodie/hive/util/SchemaUtil.java#L410
This will use the hiveSchema generated here by the convertParquetSchemaToHiveSchema method which uses the hiveCompatibleFieldName method to return a backTicked key.