apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.86k stars 1.77k forks source link

[Usage] kafka to hive #1625

Open better629 opened 2 years ago

better629 commented 2 years ago

Search before asking

Description

Have any one have load data from kakfa to hive successed? Can u share a example?

Seatunnel 2.1.0(compiled) Hive 2.5.7 Spark 2.4.0 Hadoop 2.8.0

Although I can load data from kafka to hive table sometimes: Hive Table:db_test.db/test_table/user_10.db/part-00000-e7732d9e-e4ff-4499-8c6a-60216ba51fdb-c000.snappy.parquet

And has a error:

28899 [streaming-job-executor-0] INFO  hive.metastore  - Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook
Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError: com.facebook.fb303.FacebookService$Client.sendBaseOneway(Ljava/lang/String;Lorg/apache/thrift/TBase;)V
    at com.facebook.fb303.FacebookService$Client.send_shutdown(FacebookService.java:436)
    at com.facebook.fb303.FacebookService$Client.shutdown(FacebookService.java:430)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.close(HiveMetaStoreClient.java:492)

But I can table test_table not in show tables and the data in parquet file is only part of the produce messages.

Usage Scenario

-

Related issues

-

Are you willing to submit a PR?

Code of Conduct

Hisoka-X commented 2 years ago

Do you add other dependency? This error seem like dependency conflict. Also please provide your config.

better629 commented 2 years ago

no other dependency, and the config is

source {
  kafkaStream {
    topics = "test_topic"
    consumer.bootstrap.servers = "xxx1:9092,xxx2:9092,xxx3:9092"
    consumer.group.id = "seatunnel_group"
  }
}

transform {
  json {
    source_field = "raw_message"
    result_table_name = res_table
  }
}

sink {
  Hive {
    source_table_name = "res_table"
    result_table_name = "xxx.seatunnel_test"
    save_mode = "append"
    sink_columns = "id,key"
  }
}
ruanwenjun commented 2 years ago

@better629 Can you read write data into hive success with your spark environment? It seems this caused by your spark job has not loaded the jar libfb303-0.9.3.jar, and this jar should in your ${spark_home}/jars

better629 commented 2 years ago

@ruanwenjun the jar exists, it's xxx]# find ./ -name "*fb303*" ./jars/libfb303-0.9.3.jar

someorz commented 2 years ago

@better629 i have got the same problem,did you solve it?

better629 commented 2 years ago

@someorz I have tried version seatunnel-1.5.7, and it works. Maybe you can have a try. But the above problem not solved.

someorz commented 2 years ago

@better629 modify root pom.xml maven-shade-plugin,exclude org.apache.thrift:libthrift

image

better629 commented 2 years ago

@someorz how to find this?

someorz commented 2 years ago

@better629 analyza dependencies image

ashulin commented 2 years ago

Snipaste_2022-06-07_15-23-30 Hive actually uses libthrift-0.9.3 instead of libthrift-0.9.0, but due to dependency conflicts, the packaged jar is libthrift-0.9.0. You can try to exclude the libthrift-0.9.0 dependency in the seatunnel-connectors-spark-dist module.