apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

Hive partition column defaults to String #493

Closed arw357 closed 5 years ago

arw357 commented 6 years ago

The hive partition column at hive table creation time defaults to "String" type because the type of the partition column name is searched in a map which has it's keys surrounded by backticks ("`") ,thus the key is not found.

This is where the partition key type is found by the code : https://github.com/uber/hudi/blob/5cb28e7b1ff09149ca88fc761d0e19369c6c3505/hoodie-hive/src/main/java/com/uber/hoodie/hive/util/SchemaUtil.java#L410

This will use the hiveSchema generated here by the convertParquetSchemaToHiveSchema method which uses the hiveCompatibleFieldName method to return a backTicked key.

vinothchandar commented 6 years ago

@bvaradar :) Hive expert..

AndrewKL commented 5 years ago

I ran into this as well

scala> CompactApInvoicesAll.saveToHudiTable(
     |       deltas,
     |       tableName = "ap_invoices_all_hudi",
     |       tablePath = "hdfs:///tmp/hudi-test/",
     |       primaryKey = "invoice_id",
     |       pkOrderingCol = "capture_timestamp",
     |       enableHiveSync = true
     |     )
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".                
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
com.uber.hoodie.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS default.ap_invoices_all_hudi( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `invoice_id` string, `last_update_date` bigint, `last_updated_by` string, `vendor_id` string, `invoice_num` string, `set_of_books_id` string, `invoice_currency_code` string, `payment_currency_code` string, `payment_cross_rate` string, `invoice_amount` string, `vendor_site_id` string, `amount_paid` string, `discount_amount_taken` string, `invoice_date` bigint, `source` string, `invoice_type_lookup_code` string, `description` string, `batch_id` string, `amount_applicable_to_discount` string, `tax_amount` string, `terms_id` string, `terms_date` bigint, `payment_method_lookup_code` string, `pay_group_lookup_code` string, `accts_pay_code_combination_id` string, `payment_status_flag` string, `creation_date` bigint, `created_by` string, `base_amount` string, `vat_code` string, `last_update_login` string, `exclusive_payment_flag` string, `po_header_id` string, `freight_amount` string, `goods_received_date` bigint, `invoice_received_date` bigint, `voucher_num` string, `approved_amount` string, `recurring_payment_id` string, `exchange_rate` string, `exchange_rate_type` string, `exchange_date` bigint, `earliest_settlement_date` bigint, `original_prepayment_amount` string, `doc_sequence_id` string, `doc_sequence_value` string, `doc_category_code` string, `attribute1` string, `attribute2` string, `attribute3` string, `attribute4` string, `attribute5` string, `attribute6` string, `attribute7` string, `attribute8` string, `attribute9` string, `attribute10` string, `attribute11` string, `attribute12` string, `attribute13` string, `attribute14` string, `attribute15` string, `attribute_category` string, `approval_status` string, `approval_description` string, `invoice_distribution_total` string, `posting_status` string, `prepay_flag` string, `authorized_by` string, `cancelled_date` bigint, `cancelled_by` string, `cancelled_amount` string, `temp_cancelled_amount` string, `project_accounting_context` string, `ussgl_transaction_code` string, `ussgl_trx_code_context` string, `project_id` string, `task_id` string, `expenditure_type` string, `expenditure_item_date` bigint, `pa_quantity` string, `expenditure_organization_id` string, `pa_default_dist_ccid` string, `vendor_prepay_amount` string, `payment_amount_total` string, `awt_flag` string, `awt_group_id` string, `reference_1` string, `reference_2` string, `org_id` string, `pre_withholding_amount` string, `global_attribute_category` string, `global_attribute1` string, `global_attribute2` string, `global_attribute3` string, `global_attribute4` string, `global_attribute5` string, `global_attribute6` string, `global_attribute7` string, `global_attribute8` string, `global_attribute9` string, `global_attribute10` string, `global_attribute11` string, `global_attribute12` string, `global_attribute13` string, `global_attribute14` string, `global_attribute15` string, `global_attribute16` string, `global_attribute17` string, `global_attribute18` string, `global_attribute19` string, `global_attribute20` string, `auto_tax_calc_flag` string, `payment_cross_rate_type` string, `payment_cross_rate_date` bigint, `pay_curr_invoice_amount` string, `mrc_base_amount` string, `mrc_exchange_rate` string, `mrc_exchange_rate_type` string, `mrc_exchange_date` string, `mrc_posting_status` string, `paid_on_behalf_employee_id` string, `amt_due_ccard_company` string, `amt_due_employee` string, `gl_date` bigint, `award_id` string, `approval_ready_flag` string, `approval_iteration` string, `wfapproval_status` string, `requester_id` string, `validation_request_id` string, `validated_tax_amount` string, `quick_credit` string, `credited_invoice_id` string, `distribution_set_id` string, `application_id` string, `product_table` string, `reference_key1` string, `reference_key2` string, `reference_key3` string, `reference_key4` string, `reference_key5` string, `total_tax_amount` string, `self_assessed_tax_amount` string, `tax_related_invoice_id` string, `trx_business_category` string, `user_defined_fisc_class` string, `taxation_country` string, `document_sub_type` string, `supplier_tax_invoice_number` string, `supplier_tax_invoice_date` bigint, `supplier_tax_exchange_rate` string, `tax_invoice_recording_date` bigint, `tax_invoice_internal_seq` string, `legal_entity_id` string, `historical_flag` string, `force_revalidation_flag` string, `bank_charge_bearer` string, `remittance_message1` string, `remittance_message2` string, `remittance_message3` string, `unique_remittance_identifier` string, `uri_check_digit` string, `settlement_priority` string, `payment_reason_code` string, `payment_reason_comments` string, `payment_method_code` string, `delivery_channel_code` string, `quick_po_header_id` string, `net_of_retainage_flag` string, `release_amount_net_of_tax` string, `control_amount` string, `party_id` string, `party_site_id` string, `pay_proc_trxn_type_code` string, `payment_function` string, `cust_registration_code` string, `cust_registration_number` string, `port_of_entry_code` string, `external_bank_account_id` string, `vendor_contact_id` string, `internal_contact_email` string, `disc_is_inv_less_tax_flag` string, `exclude_freight_from_discount` string, `pay_awt_group_id` string, `original_invoice_amount` string, `dispute_reason` string, `remit_to_supplier_name` string, `remit_to_supplier_id` string, `remit_to_supplier_site` string, `remit_to_supplier_site_id` string, `relationship_id` string, `capture_timestamp` bigint, `integ_key` string, `op_type` string, `updatedby_user` string) PARTITIONED BY ( String) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs:/tmp/hudi-test'
  at com.uber.hoodie.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:457)
  at com.uber.hoodie.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:260)
  at com.uber.hoodie.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:129)
  at com.uber.hoodie.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
  at com.uber.hoodie.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
  at com.uber.hoodie.DefaultSource.syncHive(DefaultSource.scala:272)
  at com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:247)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
  at CompactApInvoicesAll$.saveToHudiTable(CompactApInvoicesAll.scala:54)
  ... 56 elided
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:5717 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
  at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
  at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
  at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
  at org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
  at org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
  at com.uber.hoodie.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:455)
  ... 82 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:5717 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
  at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
  at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
  at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
  at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
  at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
  at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
  at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
  at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
  at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
  at com.sun.proxy.$Proxy41.executeStatementAsync(Unknown Source)
  at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)
  at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530)
  at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
  at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
  at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
  at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:5717 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
  at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:211)
  at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
  at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
  at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
  at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)
  at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
  ... 27 more
CREATE EXTERNAL TABLE  IF NOT EXISTS default.ap_invoices_all_hudi( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, ...`integ_key` string, `op_type` string, `updatedby_user` string) 
PARTITIONED BY ( String) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION 'hdfs:/tmp/hudi-test'
bvaradar commented 5 years ago

Will look into this tonight.

bvaradar commented 5 years ago

@AndrewKL : BTW, can you let us know the Hive and Hadoop versions you are using.

AndrewKL commented 5 years ago

Here's some more detailed info aws emr 5.19.0 spark 2.3.1 hadoop 2.8.5 apache-hive-2.3.3-amzn-1-src glue catalogue enabled

val fs: FileSystem = FileSystem.get(delta.sparkSession.sparkContext.hadoopConfiguration)

    val writer = delta.write.format("com.uber.hoodie") // specify the hoodie source
      //      .option("hoodie.insert.shuffle.parallelism", "2")
      //      .option("hoodie.upsert.shuffle.parallelism", "2")  //TODO come up with a sane level of parallelism
      //      .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)  //default copy on write
      .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, primaryKey)
      //.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")  this data is unpartitioned
      .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, pkOrderingCol)
      .option(HoodieWriteConfig.TABLE_NAME, tableName)
      .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
      //      .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, hiveJdbcUrl)
      //      .option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, hiveUser)
      //      .option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, hivePass)
      .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, enableHiveSync)//this crashes if enable due to un partitioned data
      .mode(SaveMode.Append)
    writer.save(tablePath)
    val commitTime = HoodieDataSourceHelpers.latestCommit(fs, tablePath)
    println("commit at instant time :" + commitTime)

    writer.save(tablePath)
AndrewKL commented 5 years ago

Are unpartitioned tables supported by hudi?

bvaradar commented 5 years ago

Sorry for the delay.

@arw357 : Thanks for debugging the issue. I have a pending PR ( https://github.com/uber/hudi/pull/515) that fixes it. Please use it if you are planning to try hoodie.

@AndrewKL : Hoodie is supposed to work with both partitioned and non-partitioned hive tables. There were some minor issues with partition-path handling which this PR (https://github.com/uber/hudi/pull/515) addresses. I have also added support in HoodieJavaApp and in integration-test to test non-partitioned table. Please try it out when you get a chance.

AndrewKL commented 5 years ago

I'll update my code and give it a shot

AndrewKL commented 5 years ago

I'm still in the middle of testing this buti found another minor bug. Currently the HiveSyncTool requires a partition col.

com.uber.hoodie.com.beust.jcommander.ParameterException: The following options are required: [--user], [--partitioned-by], [--jdbc-url], [--pass]

vinothchandar commented 5 years ago

@AndrewKL non partitioned tables must be supported now.. (@bvaradar fyi).. close this?