StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.77k stars 1.76k forks source link

query error from hudi external table when hudi table created in spark-shell #4404

Closed tiannan-sr closed 2 years ago

tiannan-sr commented 2 years ago

Steps to reproduce the behavior (Required)

  1. create hudi table in spark-shell:
    
    import scala.collection.JavaConversions._
    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.SaveMode._
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.types.DataType
    import org.apache.spark.sql.Row
    import java.sql._
    import org.apache.hudi.common.table.HoodieTableConfig._
    import org.apache.hudi.config.HoodieWriteConfig._
    import org.apache.hudi.config.HoodieStorageConfig._

val rows = Seq(Row(1,true,1,10000L,1.001F,1.0001,Decimal(12345678901234567890.123456789012345678),Date.valueOf("2020-01-01"),Timestamp.valueOf("2020-01-01 00:00:01"),"Top 10 Unsolved Mysteries of Paleontological Dinosaurs, Did You Know?",List(1,2,3,4),"hello".getBytes("utf-8")),Row(2,true,2,20000L,2.002F,2.0002,Decimal(20.02),Date.valueOf("2020-02-01"),Timestamp.valueOf("2020-02-01 02:00:01"),"Xi Jinping, general secretary of the Communist Party of China Central Committee, called on Tuesday for unrelenting efforts in exercising full and rigorous governance over the Party, saying that the CPC will continue to show zero tolerance for corruption.",List(2,3,4,5),"nihao".getBytes("utf-8")),Row(3,true,3,30000L,3.003F,3.0003,Decimal(34230.02),Date.valueOf("2020-03-01"),Timestamp.valueOf("2020-03-01 03:00:01"),"Nation raises caution on overseas packages",List(3,3,3,4),"hi".getBytes("utf-8")),Row(4,true,4,40000L,4.004F,4.0004,Decimal(33453534523520.02435235),Date.valueOf("2020-04-01"),Timestamp.valueOf("2020-04-01 04:00:01"),"Shanghai, Shenzhen register most newly listed firms in 2021",List(4,4,4,4),"hello,world".getBytes("utf-8")),Row(5,true,5,50000L,5.555F,5.0005,Decimal(33453555534523520.555502435235),Date.valueOf("2020-05-01"),Timestamp.valueOf("2020-05-01 05:00:01"),"A total of 4,685 companies have been listed on the A-share market in China as of Dec 31, 2021, with 46 percent of them based in Beijing, Shanghai, Shenzhen, Hangzhou, Suzhou, Guangzhou, Ningbo, Nanjing, Wuxi, and Chengdu, said a report of National Business Daily on Wednesday.",List(5,50,500,5000),"hello,555".getBytes("utf-8")),Row(6,true,6,60000L,6.006F,6.0006,Decimal(66633453534523520.666602435235),Date.valueOf("2020-06-01"),Timestamp.valueOf("2020-06-01 04:00:01"),"The development of enterprises, as well as the push of local governments, has boosted the listings, the report said. For example, in early 2018, Central China''s Hubei province launched a plan to double its listed companies by including the listing of enterprises into the government''s annual performance assessment.",List(6,60,600,6004),"666hello".getBytes("utf-8")),Row(7,true,7,70000L,7.7774F,7.0777,Decimal(777733453534523520.77702435235),Date.valueOf("2020-07-01"),Timestamp.valueOf("2020-07-01 07:00:01"),"Most listed companies in the top 10 cities come from industries such as information technology, electronics, mechanical equipment, medical biology, and electric power equipment, with information technology bearing most listed companies.",List(7,7,7,70),"777".getBytes("utf-8")),Row(8,true,8,80000L,8.888F,8.08888,Decimal(88833453534523520.888802435235),Date.valueOf("2020-08-01"),Timestamp.valueOf("2020-08-01 08:00:01"),"Citi report finds MNC mood in China buoyant",List(8,84,884,88884),"8888world".getBytes("utf-8")),Row(9,true,9,99990L,9.9999F,9.99999,Decimal(9999933453534523520.999902435235),Date.valueOf("2020-09-01"),Timestamp.valueOf("2020-09-01 09:00:09"),"Xi: China, Russia major champions of multilateralism, global justice",List(99,9,999,9999),"9999,world".getBytes("utf-8")),Row(10,false,10,10100000L,10.1001F,10.00010,Decimal(1033453534523520.102435235),Date.valueOf("2020-10-01"),Timestamp.valueOf("2020-10-01 10:10:10"),"Volkswagen deliveries hit by chip shortages",List(10,10,1010,101010),"hello,10".getBytes("utf-8")),Row(11,false,11,1110000L,11.0011F,11.1100011,Decimal(11133453534523520.11102435235),Date.valueOf("2020-11-01"),Timestamp.valueOf("2020-11-01 11:11:01"),"Light installations featuring the tiger, the Chinese zodiac animal for the upcoming lunar year, will be set up in the main plaza.",List(11,111,1111,11111),"hello,111111".getBytes("utf-8")),Row(12,false,12,120000L,12.0012F,12.00012,Decimal(121233453534523520.121202435235),Date.valueOf("2020-12-01"),Timestamp.valueOf("2020-12-01 12:12:12"),"The annual lantern show at Yuyuan Garden, a historic tourist site in Shanghai, will kick off on Tuesday and run through Feb 28.",List(12,1212,121212,121212),"h12d".getBytes("utf-8")),Row(13,null,null,null,null,null,null,null,null,null,null,"null".getBytes("utf-8")),Row(14,null,null,null,null,null,null,null,null," ",List(14,1414,141414,14141414),"1414".getBytes("utf-8")))

val schema = StructType( StructField("uuid", IntegerType, nullable = false).withComment("comment uuid") :: StructField("col_boolean", BooleanType, nullable = true).withComment("comment col_boolean") :: StructField("col_int", IntegerType, nullable = true).withComment("comment col_int") :: StructField("col_long", LongType, nullable = true).withComment("comment col_long") :: StructField("col_float", FloatType, nullable = true).withComment("comment col_float") :: StructField("col_double", DoubleType, nullable = true).withComment("comment col_double") :: StructField("col_decimal", DecimalType(38,18), nullable = true).withComment("comment col_decimal") :: StructField("col_date", DateType, nullable = true).withComment("comment col_date") :: StructField("col_timestamp", TimestampType, nullable = true).withComment("comment col_timestamp") :: StructField("col_string", StringType, nullable = true).withComment("comment col_string") :: StructField("col_array", ArrayType(IntegerType, true) , nullable = true).withComment("comment col_array") :: StructField("col_binary", BinaryType , nullable = true).withComment("comment col_binary") :: Nil) val df: DataFrame = spark.createDataFrame(rows, schema)

df.write.format("hudi"). option("hoodie.table.name", "hudi_parquet_gzip_spark_shell"). option("hoodie.datasource.write.precombine.field", "uuid"). option("hoodie.datasource.write.recordkey.field", "uuid"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.table.base.file.format", "PARQUET"). option("hoodie.parquet.compression.codec", "GZIP"). mode(Overwrite). saveAsTable("hudi_db.hudi_parquet_gzip_spark_shell") spark.sql("select * from hudi_db.hudi_parquet_gzip_spark_shell").show(false)


2. select hudi table in spark-sql ok:

spark-sql> select * from hudi_db.hudi_parquet_gzip_spark_shell order by uuid; 20220324163859449 20220324163859449_0_11 1 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 1 true 1 10000 1.0011.0001 12345678901234567000.000000000000000000 2020-01-01 2020-01-01 00:00:01 Top 10 Unsolved Mysteries of Paleontological Dinosaurs, Did You Know? [1,2,3,4] hello 20220324163859449 20220324163859449_0_12 2 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 2 true 2 20000 2.0022.0002 20.020000000000000000 2020-02-01 2020-02-01 02:00:01 Xi Jinping, general secretary of the Communist Party of China Central Committee, called on Tuesday for unrelenting efforts in exercising full and rigorous governance over the Party, saying that the CPC will continue to show zero tolerance for corruption. [2,3,4,5] nihao 20220324163859449 20220324163859449_0_3 3 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 3 true 3 30000 3.0033.0003 34230.020000000000000000 2020-03-01 2020-03-01 03:00:01 Nation raises caution on overseas packages [3,3,3,4] hi 20220324163859449 20220324163859449_0_5 4 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 4 true 4 40000 4.0044.0004 33453534523520.023000000000000000 2020-04-01 2020-04-01 04:00:01 Shanghai, Shenzhen register most newly listed firms in 2021 [4,4,4,4] hello,world 20220324163859449 20220324163859449_0_7 5 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 5 true 5 50000 5.5555.0005 33453555534523520.000000000000000000 2020-05-01 2020-05-01 05:00:01 A total of 4,685 companies have been listed on the A-share market in China as of Dec 31, 2021, with 46 percent of them based in Beijing, Shanghai, Shenzhen, Hangzhou, Suzhou, Guangzhou, Ningbo, Nanjing, Wuxi, and Chengdu, said a report of National Business Daily on Wednesday. [5,50,500,5000] hello,555 20220324163859449 20220324163859449_0_1 6 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 6 true 6 60000 6.0066.0006 66633453534523520.000000000000000000 2020-06-01 2020-06-01 04:00:01 The development of enterprises, as well as the push of local governments, has boosted the listings, the report said. For example, in early 2018, Central China''s Hubei province launched a plan to double its listed companies by including the listing of enterprises into the government''s annual performance assessment. [6,60,600,6004] 666hello 20220324163859449 20220324163859449_0_10 7 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 7 true 7 70000 7.7774 7.0777 777733453534523520.000000000000000000 2020-07-01 2020-07-01 07:00:01 Most listed companies in the top 10 cities come from industries such as information technology, electronics, mechanical equipment, medical biology, and electric power equipment, with information technology bearing most listed companies. [7,7,7,70] 777 20220324163859449 20220324163859449_0_2 8 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 8 true 8 80000 8.8888.08888 88833453534523520.000000000000000000 2020-08-01 2020-08-01 08:00:01 Citi report finds MNC mood in China buoyant [8,84,884,88884] 8888world 20220324163859449 20220324163859449_0_13 9 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 9 true 9 99990 9.9999 9.99999 9999933453534523000.000000000000000000 2020-09-01 2020-09-01 09:00:09 Xi: China, Russia major champions of multilateralism, global justice [99,9,999,9999] 9999,world 20220324163859449 20220324163859449_0_14 10 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 10 false 10 10100000 10.1001 10.0001 1033453534523520.100000000000000000 2020-10-01 2020-10-01 10:10:10 Volkswagen deliveries hit by chip shortages [10,10,1010,101010] hello,10 20220324163859449 20220324163859449_0_4 11 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 11 false 11 1110000 11.0011 11.1100011 11133453534523520.000000000000000000 2020-11-01 2020-11-01 11:11:01 Light installations featuring the tiger, the Chinese zodiac animal for the upcoming lunar year, will be set up in the main plaza. [11,111,1111,11111] hello,111111 20220324163859449 20220324163859449_0_6 12 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 12 false 12 120000 12.0012 12.00012 121233453534523520.000000000000000000 2020-12-01 2020-12-01 12:12:12 The annual lantern show at Yuyuan Garden, a historic tourist site in Shanghai, will kick off on Tuesday and run through Feb 28. [12,1212,121212,121212] h12d 20220324163859449 20220324163859449_0_8 13 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 13 NULL NULL NULL NULLNULL NULL NULL NULL NULL NULL null 20220324163859449 20220324163859449_0_9 14 8363b295-d18e-4ae1-9766-039dc7675f34-0_0-33-1617_20220324163859449.parquet 14 NULL NULL NULL NULLNULL NULL NULL NULL [14,1414,141414,14141414] 1414 Time taken: 1.139 seconds, Fetched 14 row(s)


3. desc formatted hudi table:

spark-sql> desc formatted hudi_db.hudi_parquet_gzip_spark_shell; uuid int comment uuid col_boolean boolean comment col_boolean col_int int comment col_int col_long bigint comment col_long col_float float comment col_float col_double double comment col_double col_decimal decimal(38,18) comment col_decimal col_date date comment col_date col_timestamp timestamp comment col_timestamp col_string string comment col_string col_array array comment col_array col_binary binary comment col_binary

Detailed Table Information

Database hudi_db Table hudi_parquet_gzip_spark_shell Owner root Created Time Thu Mar 24 16:39:32 CST 2022 Last Access UNKNOWN Created By Spark 3.1.2 Type MANAGED Provider hudi Statistics 439454 bytes Location hdfs://emr-header-1.cluster-49155:9000/user/hive/warehouse/hudi_db.db/hudi_parquet_gzip_spark_shell Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [hoodie.datasource.write.precombine.field=uuid, hoodie.table.base.file.format=PARQUET, hoodie.parquet.compression.codec=GZIP, hoodie.datasource.write.recordkey.field=uuid, hoodie.table.name=hudi_parquet_gzip_spark_shell, hoodie.datasource.write.table.type=COPY_ON_WRITE] Time taken: 0.092 seconds, Fetched 28 row(s)


4. create hudi external table and select :

create external table ex_hudi_tbl_parquet_gzip ( uuid int ,col_boolean boolean ,col_int int ,col_long bigint ,col_float float ,col_double double ,col_decimal decimal(38,18) ,col_date date ,col_string string ,col_array array ,col_binary varchar(200))
ENGINE=hudi properties ( "resource" = "hudi_emr_tn", "table" = "hudi_parquet_gzip_spark_shell", "database" = "hudi_db");

mysql> select * from ex_hudi_tbl_parquet_gzip; ERROR 1064 (HY000): com.starrocks.common.DdlException: get partition detail failed: com.starrocks.common.DdlException: get hive partition meta data failed: unsupported file format [org.apache.hadoop.mapred.SequenceFileInputFormat]


### Expected behavior (Required)
query return the right result

### Real behavior (Required)
query return error

### StarRocks version (Required)
 - You can get the StarRocks version by executing SQL `select current_version()`

mysql> select current_version(); +------------------------+ | current_version() | +------------------------+ | QA_TEST_MASTER ac2c40e | +------------------------+ 1 row in set (0.00 sec)

miomiocat commented 2 years ago

i will fix it