Open nicholasxu opened 6 months ago
@nicholasxu It may be due to caching I guess. Can you restart hive and see if you can query the data using select * from table
@nicholasxu It may be due to caching I guess. Can you restart hive and see if you can query the data using
select * from table
Thx, I restart all hive services, and set hive.stats.autogather=false, but it still gets nothing.
@nicholasxu Can you please also try to give all column names once instead of select *
@nicholasxu Can you please also try to give all column names once instead of
select *
- select all column names
With 'sort by' clause , it assuredly generates a tez job. I don't know if it affects
@nicholasxu That probably may be the reason. As with hive simple select * doesn't trigger the TEZ job. you can try adding condition WHERE 1 = 1 which should trigger job.
Flink job directly write log files, Did compaction happened? may be no parquet files generated yet and select * directly reading from parquet.
@nicholasxu That probably may be the reason. As with hive simple select * doesn't trigger the TEZ job. you can try adding condition WHERE 1 = 1 which should trigger job.
Flink job directly write log files, Did compaction happened? may be no parquet files generated yet and select * directly reading from parquet.
Condition 'WHERE 1 = 1' doesn't trigger a job yet and yep no compaction happened. I wanna know RT table(snapshot and Incremental queries) only reads from parquet isn't abnormal?
It's may be more related to how hive works and when it submits Tez job. As we know there is the data for sure underlying as with Tez job we are getting the data. '
In ideal scenario _rt table should get us know from log files.
https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb. Hive 3.1.0 兼容,
LOG_FILE_PATTERN
may be out of date, see the latest LOG_FILE_PATTERN in hudi code.
It's may be more related to how hive works and when it submits Tez job. As we know there is the data for sure underlying as with Tez job we are getting the data. '
In ideal scenario _rt table should get us know from log files.
That's logical.
https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb. Hive 3.1.0 兼容,
LOG_FILE_PATTERN
may be out of date, see the latest LOG_FILE_PATTERN in hudi code.
Thx, it maybe the cause!
It's may be more related to how hive works and when it submits Tez job. As we know there is the data for sure underlying as with Tez job we are getting the data. '
In ideal scenario _rt table should get us know from log files.
I have another question about flink async compaction in the same process, whether it needs additional configurations in DDL? Or it is supported default?
Supported by default.
Supported by default.
@danny0405 I tried, It is indeed supported by default. I am confused why the log files won't be deleted after compaction!
In my opinion log files should be deleted immediately after compaction to reduce files num. Is it right?
@nicholasxu They are deleted as part of cleaning process. We do need them for point in time queries.
@nicholasxu They are deleted as part of cleaning process. We do need them for point in time queries.
Ok, thx!
Describe the problem you faced
I use Flink write HUDI MOR table, and Flink read table normally, while RO table and RT table read nothing by hive
To Reproduce
Steps to reproduce the behavior:
CREATE CATALOG hudi_hive_catalog WITH ( 'type'='hudi', 'catalog.path' = 'cosn://bigdata-xxx/user/hive/warehouse', 'hive.conf.dir' = '/usr/local/service/hive/conf', 'mode'='hms', 'table.external' = 'true', 'default-database' = 'hudi_default' );
use CATALOG hudi_hive_catalog;
CREATE TABLE t1( mid BIGINT PRIMARY KEY NOT ENFORCED, uuid VARCHAR(20), name VARCHAR(10), age INT, ts BIGINT, part INT ) PARTITIONED BY (part) WITH ( 'connector' = 'hudi', 'path' = 'cosn://bigdata-xxx/user/hive/warehouse/hudi_default.db/t1', 'table.type' = 'MERGE_ON_READ', 'hive_sync.enable' = 'true',
'hive_sync.mode' = 'hms',
'hive_sync.metastore.uris' = 'thrift://xxx:9083' )
3.Insert some data by Flink INSERT INTO t1 VALUES (1,'334e26e9-8355-45cc-97c6-c31daf0df330','nick', 18,1695159649087,20230108), (2,'334e26e9-8355-45cc-97c6-c31daf0df330','jack', 18,1695159649087,20230109);
4.Read data by Flink and get right records SELECT * FROM t1;![image](https://github.com/apache/hudi/assets/12593964/30cb73e2-22a1-414f-8035-b52f5ad9e6ac)
Use 'select ' reading data by Hive and get nothing select from t1_rt;
select * from t1_ro;
![image](https://github.com/apache/hudi/assets/12593964/da205ace-79a8-4a98-98f2-5d37ccf91c56)
Read data with 'order by clause' by Hive and get right results select from t1_rt order by mid;
select from t1_ro order by mid;
![image](https://github.com/apache/hudi/assets/12593964/eeab5e22-8ef9-4112-8a4e-2389489a2373)
8.Huid files on cos:![image](https://github.com/apache/hudi/assets/12593964/8871477d-7713-499f-ac47-2b8644f46220)
9.Test COW table is ok
Expected behavior Reading nothing from RO table may be OK, because it only has a log file and without parquet base files, but reading nothing from RT table is confused, your help is appreciative.
Environment Description
Hudi version :0.14.1
Spark version :3.2.2
Hive version :3.1.3
Hadoop version :3.2.2
Storage (HDFS/S3/GCS..) :COS on Tencent Cloud
Running on Docker? (yes/no) :