Open alberttwong opened 1 year ago
In beehive you can see that I can do show tables and show data in tables.
atwong@Alberts-MBP docker % docker exec -it adhoc-2 /bin/bash
root@adhoc-2:/opt# beeline -u jdbc:hive2://hiveserver:10000 \
> --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
> --hiveconf hive.stats.autogather=false
Connecting to jdbc:hive2://hiveserver:10000
23/05/13 02:21:33 INFO jdbc.Utils: Supplied authorities: hiveserver:10000
23/05/13 02:21:33 INFO jdbc.Utils: Resolved authority: hiveserver:10000
23/05/13 02:21:33 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://hiveserver:10000
Connected to: Apache Hive (version 2.3.3)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://hiveserver:10000> show tables;
+---------------------+--+
| tab_name |
+---------------------+--+
| stock_ticks_cow |
| stock_ticks_mor_ro |
| stock_ticks_mor_rt |
+---------------------+--+
3 rows selected (0.081 seconds)
0: jdbc:hive2://hiveserver:10000> select count(*) from stock_ticks_cow
0: jdbc:hive2://hiveserver:10000> ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+------+--+
| _c0 |
+------+--+
| 197 |
+------+--+
1 row selected (3.117 seconds)
showing the output in SR.
StarRocks > set catalog hudi_catalog_hms;
Query OK, 0 rows affected (0.01 sec)
StarRocks > use default;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
StarRocks > show tables;
+--------------------+
| Tables_in_default |
+--------------------+
| stock_ticks_cow |
| stock_ticks_mor_ro |
| stock_ticks_mor_rt |
+--------------------+
3 rows in set (0.02 sec)
StarRocks > select count(*) from stock_ticks_cow;
ERROR 1064 (HY000): com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException: null value in entry: date=2018-08-31=null
This issue is due to starrocks relying on metadata from hive metatsore for hudi queries. We will fix this issue later. Currently, there is a solution:
After hudi sync hms using run_sync_tool.sh , using hive or spark to run msck repair table stock_ticks_cow
, then query stock_ticks_cow in starrocks will work well, if not, plz run refresh external table stock_ticks_cow
in starrocks
Just to confirm. This is an issue with our implementation with Apache Hudi and not something Hudi would fix.
Confirmed with StarRocks engineering that it is an issue on StarRocks' side.
It worked when I applied the repair and refresh. I also tried it again and I only needed to do the refresh command to make it work.
atwong@Alberts-MacBook-Pro docker % docker exec -it adhoc-2 /bin/bash
root@adhoc-2:/opt# beeline -u jdbc:hive2://hiveserver:10000 \
> --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
> --hiveconf hive.stats.autogather=false
Connecting to jdbc:hive2://hiveserver:10000
23/05/17 16:26:50 INFO jdbc.Utils: Supplied authorities: hiveserver:10000
23/05/17 16:26:50 INFO jdbc.Utils: Resolved authority: hiveserver:10000
23/05/17 16:26:50 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://hiveserver:10000
Connected to: Apache Hive (version 2.3.3)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://hiveserver:10000> msck repair table stock_ticks_cow;
No rows affected (0.129 seconds)
0: jdbc:hive2://hiveserver:10000>
StarRocks > select count(*) from stock_ticks_cow;
ERROR 1064 (HY000): com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException: null value in entry: date=2018-08-31=null
StarRocks > refresh external table stock_ticks_cow;
Query OK, 0 rows affected (0.11 sec)
StarRocks > select count(*) from stock_ticks_cow;
+----------+
| count(*) |
+----------+
| 197 |
+----------+
1 row in set (1.23 sec)
We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!
I think this needs to be reopened, trying today with 3.2.2 allin1 and Hudi 0.14.1 and seeing:
StarRocks > select count(*) from stock_ticks_cow;
ERROR 1064 (HY000): com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException: null value in entry: date=2018-08-31=null
Issue a refresh and try the query:
StarRocks > refresh external table stock_ticks_cow;
Query OK, 0 rows affected (0.14 sec)
StarRocks > select count(*) from stock_ticks_cow;
+----------+
| count(*) |
+----------+
| 197 |
+----------+
1 row in set (1.46 sec)
The main cause of this issue is we use hms to get metadata, and in Hudi, user need to enable metadata sink to hms when spark of flink modify hudi table. That is, user need to set hoodie.datasource.hive_sync.enable
to true. https://hudi.apache.org/docs/configurations/ it is false by default. @DanRoscigno Dan could you pls check it?
The main cause of this issue is we use hms to get metadata, and in Hudi, user need to enable metadata sink to hms when spark of flink modify hudi table. That is, user need to set
hoodie.datasource.hive_sync.enable
to true. https://hudi.apache.org/docs/configurations/ it is false by default. @DanRoscigno Dan could you pls check it?
I will check, thanks so much!
@wangsimo0 Can you show me how to configure this? I am using the Docker Demo at https://hudi.apache.org/docs/docker_demo/
I tried these env vars in the containers:
HIVE_SITE_CONF_hive_sync_enable=true
HIVE_SITE_CONF_hive_sync_db=default
HIVE_SITE_CONF_hive_sync_mode=hms
But no change, I still have to refresh external table <tablename>;
@DanRoscigno By still have to refresh, do you get error after select? It's possible if you are quering new partitions. The core reason is like select from hive. We cache meta in starrocks, if the hudi table is being ingested or updated, starrocks cannot get the latest information, and we haven't supported refresh metacache cyclical so there is no way we can know the metadata update. So error may happen when user is querying new partitions because starrocks doesn't have those partitions cache. Also if the old partition is updated, starrocks will return the old data because of the cache. This is absolutely not user-friendly. We are planning to use Hudi SDK to get hudi metadata to solve this problem completely, however, we don't have sufficient manpower and we are seeking for community developers who are interested in this to work with us. By now, unfortunately we do have this limitation.
After a refresh everything seems fine. Note that with OneHouse Hudi @alberttwong did not have to refresh, everything worked on the first try. I will update the docs to include a refresh of each table. Please let me know when this changes and I will remove the refresh from the docs. Thanks for explaining this to me and your help @wangsimo0
We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!
I hit the same exception when using StarRocks to read AWS Glue external tables.
org.jkiss.dbeaver.model.sql.DBSQLException: SQL Error [1064] [42000]: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException: null value in entry: dt=2024-07-09/hour=11=null
at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCStatementImpl.executeStatement(JDBCStatementImpl.java:133)
at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.executeStatement(SQLQueryJob.java:615)
at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.lambda$2(SQLQueryJob.java:506)
at org.jkiss.dbeaver.model.exec.DBExecUtils.tryExecuteRecover(DBExecUtils.java:192)
at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.executeSingleQuery(SQLQueryJob.java:525)
at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.extractData(SQLQueryJob.java:977)
at org.jkiss.dbeaver.ui.editors.sql.SQLEditor$QueryResultsContainer.readData(SQLEditor.java:4176)
at org.jkiss.dbeaver.ui.controls.resultset.ResultSetJobDataRead.lambda$0(ResultSetJobDataRead.java:123)
at org.jkiss.dbeaver.model.exec.DBExecUtils.tryExecuteRecover(DBExecUtils.java:192)
at org.jkiss.dbeaver.ui.controls.resultset.ResultSetJobDataRead.run(ResultSetJobDataRead.java:121)
at org.jkiss.dbeaver.ui.controls.resultset.ResultSetViewer$ResultSetDataPumpJob.run(ResultSetViewer.java:5160)
at org.jkiss.dbeaver.model.runtime.AbstractJob.run(AbstractJob.java:117)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)
Caused by: java.sql.SQLSyntaxErrorException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException: null value in entry: dt=2024-07-09/hour=11=null
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:121)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
at com.mysql.cj.jdbc.StatementImpl.executeInternal(StatementImpl.java:770)
at com.mysql.cj.jdbc.StatementImpl.execute(StatementImpl.java:653)
at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCStatementImpl.execute(JDBCStatementImpl.java:330)
at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCStatementImpl.executeStatement(JDBCStatementImpl.java:131)
... 12 more
Also, I could only refresh the external table by specifying partitions. When I removed the PARTITION
clause and tried to refresh the whole table, I still got the exception above.
Debugged the issue locally. It seems to be related to max_hive_partitions_per_rpc
. When using the default value 5000
, partitions = client.hiveClient.getPartitionsByNames(dbName, tblName, partitionNames)
(in HiveMetaClient.java
) couldn't get all partitions each time. I guess it's either because we have too many partitions or the metadata for each partition is too large. After reducing the value to 100
, everything worked well.
Instructions Follow the Hudi Docker Quickstart. https://hudi.apache.org/docs/docker_demo
Modifed docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml to include starrocks in the hudi docker compose. Also you need to apply https://github.com/apache/hudi/issues/8700 if they haven't merged it in yet.
Do all the steps in the hudi docker compose quickstart. When you can do a show tables with beehive, you know that tables are ready and SR should be able to connect.
login to the SR container within the hudi docker compose
then execute the sql commands.