apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.32k stars 3.21k forks source link

[Enhancement] support gzip/bzip2 for hive catalog #22339

Closed alanredsheep closed 1 year ago

alanredsheep commented 1 year ago

Search before asking

Description

Doris ver: 1.2.6-rc03 Hive ver: 1.1.0

DataX hdfs-writer only support gzip/bzip2 for textfile table compression. Here is docs. https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md

But doris cannot read hive textfile table compressed by gzip. Here is the error message.

ERROR 1105 (HY000): errCode = 2, detailMessage = (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec

I used datax to collect data from mysql to hive, so there are so many textfile table compressed by gzip in Hive. So I really need Hive catalog support gzip/bzip2 compression.

Solution

This question is similar to another issus in https://github.com/apache/doris/pull/19387. The solution may be similar too.

Are you willing to submit PR?

Code of Conduct

alanredsheep commented 1 year ago

@dutyu

dutyu commented 1 year ago

@dutyu

Can you paste some more detail logs about this error, you can find the error log at fe.warn.log ~

alanredsheep commented 1 year ago

Can you paste some more detail logs about this error, you can find the error log at fe.warn.log ~

@dutyu Here is the error message in fe.warn.log.

2023-07-31 14:16:44,233 WARN (thrift-server-pool-5|3008) [Coordinator.updateFragmentExecStatus():1737] one instance report fail, query_id=24a4dd194fb54409-93d758621bb108f8 instance_id=24a4dd194fb54409-93d758621bb108f9, error message: (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,234 WARN (thrift-server-pool-5|3008) [Coordinator.updateStatus():875] one instance report fail throw updateStatus(), need cancel. job id: -1, query id: 24a4dd194fb54409-93d758621bb108f8, instance id: 24a4dd194fb54409-93d758621bb108f9, error message: (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [Coordinator.getNext():895] get next fail, need cancel. query id: 24a4dd194fb54409-93d758621bb108f8
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [Coordinator.getNext():915] query failed: (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [StmtExecutor.sendResult():1212] cancel fragment query_id:24a4dd194fb54409-93d758621bb108f8 cause errCode = 2, detailMessage = (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [StmtExecutor.execute():589] execute Exception. stmt[553, 24a4dd194fb54409-93d758621bb108f8]
org.apache.doris.common.UserException: errCode = 2, detailMessage = (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
        at org.apache.doris.qe.Coordinator.getNext(Coordinator.java:922) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.sendResult(StmtExecutor.java:1156) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1122) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:522) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:409) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:330) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.dispatch(ConnectProcessor.java:473) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:700) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362]
dutyu commented 1 year ago

Can you paste some more detail logs about this error, you can find the error log at fe.warn.log ~

@dutyu Here is the error message in fe.warn.log.

2023-07-31 14:16:44,233 WARN (thrift-server-pool-5|3008) [Coordinator.updateFragmentExecStatus():1737] one instance report fail, query_id=24a4dd194fb54409-93d758621bb108f8 instance_id=24a4dd194fb54409-93d758621bb108f9, error message: (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,234 WARN (thrift-server-pool-5|3008) [Coordinator.updateStatus():875] one instance report fail throw updateStatus(), need cancel. job id: -1, query id: 24a4dd194fb54409-93d758621bb108f8, instance id: 24a4dd194fb54409-93d758621bb108f9, error message: (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [Coordinator.getNext():895] get next fail, need cancel. query id: 24a4dd194fb54409-93d758621bb108f8
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [Coordinator.getNext():915] query failed: (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [StmtExecutor.sendResult():1212] cancel fragment query_id:24a4dd194fb54409-93d758621bb108f8 cause errCode = 2, detailMessage = (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
2023-07-31 14:16:44,237 WARN (mysql-nio-pool-126|20311) [StmtExecutor.execute():589] execute Exception. stmt[553, 24a4dd194fb54409-93d758621bb108f8]
org.apache.doris.common.UserException: errCode = 2, detailMessage = (xx.xx.xx.xx)[INTERNAL_ERROR]Only support csv data in utf8 codec
        at org.apache.doris.qe.Coordinator.getNext(Coordinator.java:922) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.sendResult(StmtExecutor.java:1156) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1122) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:522) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:409) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:330) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.dispatch(ConnectProcessor.java:473) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:700) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362]

ok, i will look at this problem later ~

dutyu commented 1 year ago

I've read the source code, doris use the file's suffix to determine which file type the file is, only when the file's suffix is '.gz' then doris will recognize the files as gzip files. So please check the file's suffix~

alanredsheep commented 1 year ago

@dutyu I've checked the hive files. The suffix is '.gz'. But doris-1.2.6 still cannot read this table.

image image
dutyu commented 1 year ago

@dutyu I've checked the hive files. The suffix is '.gz'. But doris-1.2.6 still cannot read this table.

image image

Hello, you can add my wechat: 24663660

alanredsheep commented 1 year ago

Seems this [Enhancement] has been added in doris-2.0 .