ClickHouse / clickhouse-jdbc-bridge

A JDBC proxy from ClickHouse to external databases
Apache License 2.0
167 stars 60 forks source link

Can't select null field with presto jdbc ,because ResultSetMetaData.columnNullable is "unknow" #5

Closed LXHING999 closed 3 years ago

LXHING999 commented 5 years ago

image

2019-05-15 14:37:31,804 [ HTTP Handler-27 ] {QueryHandlerServlet} - query is SELECT "id", "subject_id", "paper_id", "name", "detail", "exam_type", "region_level", "regions", "clazz_levels", "school_level", "school_ids", "apply_start_at", "apply_stop_at", "exam_start_at", "exam_stop_at", "correct_stop_at", "result_issue_at", "duration_minutes", "submit_after_minutes", "question_sort", "status", "agent_name", "agent_code", "file_url", "file_name", "creator_id", "created_at", "updated_at", "group_id", "teacher_id", "deleted_at", "papers", "ranks", "dt" FROM "dw_livecast"."bak_bdm_mongo_exam" 2019-05-15 14:37:32,351 [ HTTP Handler-27 ] {QueryHandlerServlet} - java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212) at ru.yandex.clickhouse.util.ClickHouseRowBinaryStream.writeString(ClickHouseRowBinaryStream.java:72) at ru.yandex.clickhouse.jdbcbridge.db.clickhouse.ClickHouseConverter.lambda$static$13(ClickHouseConverter.java:58) at ru.yandex.clickhouse.jdbcbridge.db.clickhouse.ClickHouseFieldSerializer.serialize(ClickHouseFieldSerializer.java:34) at ru.yandex.clickhouse.jdbcbridge.db.clickhouse.ClickHouseRowSerializer.serialize(ClickHouseRowSerializer.java:22) at ru.yandex.clickhouse.jdbcbridge.servlet.QueryHandlerServlet.doPost(QueryHandlerServlet.java:60) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655) at ru.yandex.clickhouse.jdbcbridge.servlet.RequestLogger.doFilter(RequestLogger.java:32) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:503) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) at java.lang.Thread.run(Thread.java:748)

alex-krash commented 5 years ago

@LXHING999 , could you, please, provide: 1.) DDL for table in Hive/Presto 2.) Version of presto JDBC

LXHING999 commented 5 years ago

presto : presto-jdbc-310.jar ddl: CREATE TABLE dw_livecast.bak_bdm_mongo_exam( id string, subject_id string, paper_id string, name string, detail string, exam_type string, region_level string, regions string, clazz_levels string, school_level string, school_ids string, apply_start_at string, apply_stop_at string, exam_start_at string, exam_stop_at string, correct_stop_at string, result_issue_at string, duration_minutes string, submit_after_minutes string, question_sort string, status string, agent_name string, agent_code string, file_url string, file_name string, creator_id string, created_at string, updated_at string, group_id string, teacher_id string, deleted_at string, papers string, ranks string, dt string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://M7-10-6-0-87:8020/user/hive/warehouse/dw_livecast.db/bak_bdm_mongo_exam' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='42', 'numRows'='84883593', 'rawDataSize'='256133063360', 'totalSize'='2311198166', 'transient_lastDdlTime'='1557899913')

LXHING999 commented 5 years ago

i fix it by rewrite ClickHouseRowBinaryStream.java , commit Preconditions.checkNotNull(string); here is code : public void writeString(String string) throws IOException { // Preconditions.checkNotNull(string); if (null==string){ byte[] bytes = "".getBytes(StreamUtils.UTF_8); writeUnsignedLeb128(bytes.length); out.write(bytes); }else { byte[] bytes = string.getBytes(StreamUtils.UTF_8); writeUnsignedLeb128(bytes.length); out.write(bytes); } }

LXHING999 commented 5 years ago

use CREATE TABLE IF NOT EXISTS gdm_user_student_purchase_course_history_daily ENGINE = MergeTree order by tuple() as SELECT * FROM jdbc('jdbc:presto://10.7.8.83:8921/hive?user=1', 'dw_livecast', 'gdm_user_student_purchase_course_history_daily'); is slow ! 90 million row need 40 mins; CREATE TABLE dw_livecast.gdm_user_student_purchase_course_history_daily( student_id string COMMENT '学生ID', low_course_first_pay_datetime string COMMENT '低价课首次购买时间', low_course_first_order_id string COMMENT '低价课首次购买订单ID', low_course_latest_pay_datetime string COMMENT '低价课最近一次购买时间', low_course_latest_order_id string COMMENT '低价课最近一次购买订单ID', low_course_buy_count string COMMENT '低价课购买数量', normal_course_first_pay_datetime string COMMENT '正价课首次购买时间', normal_course_first_order_id string COMMENT '正价课首次购买订单ID', normal_course_latest_pay_datetime string COMMENT '正价课最近一次购买时间', normal_course_latest_order_id string COMMENT '正价课最近一次购买订单ID', normal_course_buy_count string COMMENT '正价课购买数量', course_first_pay_datetime string COMMENT '所有类型课首次购买时间', course_first_order_id string COMMENT '所有类型课首次购买订单ID', course_latest_pay_datetime string COMMENT '所有类型课最近一次购买时间', course_latest_order_id string COMMENT '所有类型课最近一次购买订单ID', course_buy_count string COMMENT '所有类型课购买数量') PARTITIONED BY ( dt string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://M7-10-6-0-87:8020/user/hive/warehouse/dw_livecast.db/gdm_user_student_purchase_course_history_daily' TBLPROPERTIES ( 'transient_lastDdlTime'='1552451362')

alex-krash commented 5 years ago

@LXHING999 , and if just select from the table from cli clickhouse-client command?

SELECT * FROM jdbc('jdbc:presto://10.7.8.83:8921/hive?user=1', 'dw_livecast', 'gdm_user_student_purchase_course_history_daily') FORMAT Null

How much time does it takes?

zhicwu commented 3 years ago

Could you try lastest code on master branch? It should work.

zhicwu commented 3 years ago

Feel free to reopen if you still see the issue. I'm also interested in the performance issue, will it be better if you issue multiple queries each taking care of a partition of the dataset?