GoogleCloudDataproc / hive-bigquery-storage-handler

Hive Storage Handler for interoperability between BigQuery and Apache Hive
Apache License 2.0
19 stars 11 forks source link

Records are not inserted to BigQuery table #22

Open vishal2535 opened 2 years ago

vishal2535 commented 2 years ago

Facing an issue with record insertion. Insert statements are giving success message but no records are inserted to BQ table. At the time of insertion, it first stores records to temp files in GCS buckets which we set at table creation time. I am seeing generated temp files in buckets but it's not pushing those temp files to further BQ table. I am using Hive 2.3.9 with Hadoop compatible version of 2.10.1 and testing it with MR as a execution engine. MR somehow does not seem to call OutputCommitter.

_Customized configuration files for my project are here along with the hive generated logs for INSERT operation. configuration_files.zip hive.log_

Note:- Joins and Select operations are working. Below are the query output logs for Insert operation.

Insert into bq_test values ('Vishal', 33);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20220214125524_7b6db0f8-9792-4acb-b4f6-6e4de208f2fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2022-02-14 12:55:27,341 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local758927901_0001
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 7 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 3.203 seconds
michTalebzadeh commented 2 years ago

Hi

After using the correct jar file added to Hive as below

ADD JAR <LOCATION>/hive-bigquery-storage-handler-1.0-shaded.jar;

and also adding the jar file gcs-connector-hadoop3-2.2.5-shaded.jar (as suggested by Igor), to HADOOP_CLASSPATH, I can insert a row through Hive table into GBQ table as below

DROP TABLE IF EXISTS test.testmebq;
CREATE EXTERNAL TABLE test.testmebq(
       ID BIGINT
    )
STORED BY
 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
 TBLPROPERTIES (
 'mapred.bq.input.dataset.id'='test',
 'mapred.bq.input.table.id'='testme',
 'mapred.bq.project.id'='xxx',
 'mapred.bq.temp.gcs.path'='gs://tmp_storage_bucket/tmp',
 'mapred.bq.gcs.bucket'='etcbucket/hive/test'
)
;

and creating a similar table in GBQ dataset test.testme with id column in lowercase. the insert from Hive goes through with no error!

 **insert  into test.testmebq values(11);***
2022-03-07 18:10:17,159 INFO  [main] conf.HiveConf: Using the default value
passed in for log id: 3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b
2022-03-07 18:10:17,406 WARN  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in
the future versions. Consider using a different execution engine (i.e.
spark, tez) or using Hive 1.X releases.
Query ID = hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf
2022-03-07 18:10:17,406 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Query ID =
hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf
Total jobs = 1
2022-03-07 18:10:17,406 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Total jobs = 1
2022-03-07 18:10:17,406 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Starting task [Stage-0:DDL] in serial mode
2022-03-07 18:10:17,420 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
avro.AvroSerDe: AvroSerde::initialize(): Preset value of
avro.schema.literal == null
2022-03-07 18:10:17,420 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
avro.AvroSerDe: AvroSerde::initialize(): Preset value of
avro.schema.literal == null
2022-03-07 18:10:17,420 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
avro.AvroSerDe: AvroSerde::initialize(): Preset value of
avro.schema.literal == null
2022-03-07 18:10:17,466 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Starting task [Stage-1:DDL] in serial mode
2022-03-07 18:10:17,822 INFO  [LocalJobRunner Map Task Executor #0]
exec.MapOperator: Initializing operator MAP[0]
2022-03-07 18:10:17,822 INFO  [LocalJobRunner Map Task Executor #0]
mr.ExecMapper:
<MAP>Id =0
  <Children>null
  <\Children>
  <Parent><\Parent>
<\MAP>
2022-03-07 18:10:18,742 WARN  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is
deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2022-03-07 18:10:18,742 Stage-2 map = 0%,  reduce = 0%
2022-03-07 18:10:18,742 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
exec.Task: 2022-03-07 18:10:18,742 Stage-2 map = 0%,  reduce = 0%
2022-03-07 18:10:19,187 INFO  [LocalJobRunner Map Task Executor #0]
exec.FileSinkOperator: RECORDS_OUT_1_test.testmebq:1,
RECORDS_OUT_INTERMEDIATE:0, RECORDS_OUT_OPERATOR_FS_5:1,
2022-03-07 18:10:19,187 INFO  [LocalJobRunner Map Task Executor #0]
mapred.LocalJobRunner:
2022-03-07 18:10:19,187 INFO  [LocalJobRunner Map Task Executor #0]
mapred.Task: Task:attempt_local378396178_0007_m_000000_0 is done. And is in
the process of committing
2022-03-07 18:10:19,190 INFO  [LocalJobRunner Map Task Executor #0]
mapred.LocalJobRunner: map
2022-03-07 18:10:19,190 INFO  [LocalJobRunner Map Task Executor #0]
mapred.Task: Task 'attempt_local378396178_0007_m_000000_0' done.
2022-03-07 18:10:19,190 INFO  [LocalJobRunner Map Task Executor #0]
mapred.LocalJobRunner: Finishing task:
attempt_local378396178_0007_m_000000_0
2022-03-07 18:10:19,190 INFO  [Thread-1079] mapred.LocalJobRunner: map task
executor complete.
2022-03-07 18:10:19,745 Stage-2 map = 100%,  reduce = 0%
2022-03-07 18:10:19,745 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b
main] exec.Task:
2022-03-07 18:10:19,745 Stage-2 map = 100%,  reduce = 0%
Ended Job = job_local378396178_0007
2022-03-07 18:10:19,746 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
exec.Task: Ended Job = job_local378396178_0007
MapReduce Jobs Launched:
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: MapReduce Jobs Launched:
Stage-Stage-2:  HDFS Read: 19312718 HDFS Write: 0 SUCCESS
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Stage-Stage-2:  HDFS Read: 19312718 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Total MapReduce CPU Time Spent: 0 msec
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: Completed executing
command(queryId=hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf);
Time taken: 2.341 seconds
*OK*
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
ql.Driver: OK
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
lockmgr.DbTxnManager: Stopped heartbeat for query:
hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf
Time taken: 2.607 seconds
2022-03-07 18:10:19,774 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
CliDriver: Time taken: 2.607 seconds
2022-03-07 18:10:19,775 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
conf.HiveConf: Using the default value passed in for log id:
3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b
2022-03-07 18:10:19,775 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main]
session.SessionState: Resetting thread name to  main

However, the insert does not show in the Hive or GBQ table!

0: jdbc:hive2://rhes75:10099/default> select * from testmebq

. . . . . . . . . . . . . . . . . . > +--------------+

| testmebq.id  |

+--------------+

+--------------+

No rows selected (2.512 seconds)

and neither in GBQ table
SELECT * FROM test.testme
Query complete (0.3 sec elapsed, 0 B processed)
Job information
Results
JSON
Execution details
This query returned no results.

Yes in short there is no error thrown but no rows added!

michTalebzadeh commented 2 years ago

As I stated this seems to be an issue with mr as engine. There is now the issue of OutputCommitter not being called somehow, something probably to do with jar, as all goes through ok with no result. Any fixes that can be applied?