apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.39k stars 2.42k forks source link

[SUPPORT] Trino returns 0 rows when reading Hudi tables written by Flink 1.16 #8870

Open Riddle4045 opened 1 year ago

Riddle4045 commented 1 year ago

Tips before filing an issue

Describe the problem you faced TL;DR Trino returns 0 records from hudi table when I can see data in object store.

I am writing hudi tables in ABFS - reduced code

        DataStream<RowData> fares = env.addSource(new TaxiFareGenerator()).map(
                event -> GenericRowData.of(
                        event.getRideId(),
                        event.getDriverId(),
                        event.getTaxiId(),
                        event.getStartTime(),
                        event.getTip(),
                        event.getTolls(),
                        event.getTotalFare()//,
                        //    event.getPaymentType()
                ));

        String targetTable = "TaxiFare";
        String outputPath = String.join("/",basePath, "hudi4");
        Map<String, String> options = new HashMap<>();

        options.put(FlinkOptions.PATH.key(), outputPath);
        options.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.MERGE_ON_READ.name());

        HoodiePipeline.Builder builder = HoodiePipeline.builder(targetTable)
                .column("rideId BIGINT")
                .column("driverId BIGINT")
                .column("taxiId BIGINT")
                .column("startTime BIGINT")
                .column("tip FLOAT")
                .column("tolls FLOAT")
                .column("totalFare FLOAT")
                .pk("driverId")
                .options(options);

        builder.sink(fares, false);
        env.execute("Hudi Table");

I sync these tables to HMS using Hudi-Sync-Tool.

2023-06-01T13:15:09,757 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync complete for **hudi5_ro**
2023-06-01T13:15:09,757 INFO [main] org.apache.hudi.hive.HiveSyncTool - Trying to sync hoodie table hudi5_rt with base path abfs://flink@****.dfs.core.windows.net/flink/click_events/hudi4 of type MERGE_ON_READ
2023-06-01T13:15:11,977 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync table hudi5_rt for the first time.
2023-06-01T13:15:17,712 INFO [main] org.apache.hudi.hive.HiveSyncTool - Last commit time synced was found to be null
2023-06-01T13:15:17,712 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync all partitions given the last commit time synced is empty or before the start of the active timeline. Listing all partitions in abfs://flink@****.dfs.core.windows.net/flink/click_events/hudi4, file system: AzureBlobFileSystem{uri=abfs://flink@****.dfs.core.windows.net, user='ispatw', primaryUserGroup='ispatw'}
2023-06-01T13:15:24,755 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync complete for **hudi5_rt**
2023-06-01T13:15:24,761 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Closed a connection to metastore, current connections: 0

I can see data streaming into the ABFS location image

When I try to query it using Trino my tables have no records image

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Riddle4045 commented 1 year ago

possibly related to https://github.com/apache/hudi/issues/8038 @codope could you please help me understand how to configure the table for read optimized queries? or is it something that Hudi Sync tool should handle out of the box - Not sure why I am not seeing any rows back.

danny0405 commented 1 year ago

The compaction is executed async by default every 5 delta_commit on the table, did you have any chance to see the Parquet files already?

Riddle4045 commented 1 year ago

The compaction is executed async by default every 5 delta_commit on the table, did you have any chance to see the Parquet files already?

@danny0405 no, there were total 6 deltacommits, no compaction - is there a setting to toggle it? maybe it's turned off by default in Flink? I can also share the .hoodie folder if it helps you understand what's going on.

danny0405 commented 1 year ago

Can you show me the running job DAG on the webpage, that would help a lot.

Riddle4045 commented 1 year ago

@danny0405 I tried a repro and looks like it compacts now after 5 commits! I guess my cluster might have been flaky during last retry. I can also read the tables now! image

danny0405 commented 1 year ago

That's cool, feel free to give us feedback if you got some new issues.