Open gamblewin opened 1 year ago
One more question, does hudi have any flink api for bulk insert? The example on official website is inserting data into Hudi table one by one, what if i want to split data source stream into different windows and when each window closes, bulk insert all data in that window into Hudi table.
For now, the only way i can think of bulk insert is use the executeSql() method of StreamTableEnvironment to execute SQL statements by concatenating the SQL string.
It seems you are using the batch query, can you check whether the data is committed into the table (by checking new commit meta files on .hoodie folder).
We can enable bulk_insert mode for Flink with option write.operation
= 'BULK_INSERT', the bulk_insert
only works for batch execution mode.
@danny0405 Thx for replying.
Data is committed into the table, but can not be queried by using sTableEnv.sqlQuery(select * from dept)
.
If i use sql way, which is inserting multiple rows in one sql and executing this sql, is this way bulk insert or not?
sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
sEnv.setRuntimeMode(RuntimeExecutionMode.BATCH); // set execution mode as batch
sTableEnv = StreamTableEnvironment.create(sEnv);
sEnv.setParallelism(1);
sEnv.enableCheckpointing(3000);
// SQL way: insert multiple rows in one sql without explicitly configuring write option as bulk insert
sTableEnv.executeSql("insert into dept values (1, 'a', NOW()), (2, 'b', NOW())");
If the above sql way is not bulk insert, is there any way i can bulk insert data by using sql? I know that for query sql, we can add options to set up some configurations, but i tried add options to insert data sql, it's not working.
insert into dept values
(1, 'a', NOW()),
(2, 'b', NOW())
/*+
options (
'write.operation' = 'bulk_insert'
)*/
I think what u really mean is using streaming API to bulk insert data. In my understanding, bulk insert means insert a batch of data at a time, but in the following code, source data is an unbounded stream, how does sink function split source data into different batches?
DataStream<RowData> dataStream = env.addSource(...);
Map<String, String> options = new HashMap<>();
// other option configurations ......
options.put("write.operation", "bulk_insert");
DataStream<RowData> dataStream = sEnv.addSource(...);
HoodiePipeline.Builder builder = HoodiePipeline.builder("dept")
.column(...)
.options(options);
builder.sink(dataStream, false);
You should define the bulk_insert
option while initializing the table with sql:
String createTabelSql = "create table dept(\n" +
" dept_id BIGINT PRIMARY KEY NOT ENFORCED,\n" +
" dept_name varchar(10),\n" +
" ts timestamp(3)\n" +
")\n" +
"with (\n" +
" 'connector' = 'hudi',\n" +
" 'path' = 'hdfs://localhost:9000/hudi/dept',\n" +
" 'table.type' = 'MERGE_ON_READ'\n" +
")";
It's weird you can't query the data, is there any exception thrown out?
@danny0405 Thx, I have reviewed the documentation on the Hudi website regarding bulk insert, and it states that bulk insert "implements a sort-based data writing algorithm", which means bulk insert and batch insert are actually not the same concept? For example, if I insert 100 records into a Hudi table in one sql, bulk insert does not optimize the performance of batch insertion. It simply provides a sorting operation during the data insertion process?
I check the web ui, there's no exception. job graph
exception page
@gamblewin From the runtime DAG, it seems you are using MOR table with upsert operation. For bulk_insert, it is expected to be executed in flink batch runtime mode, and the write.operation
should be set up as bulk_insert
.
Describe the problem you faced
I'm trying to use flink table api sqlQuery to read data from hudi table but not working, so am i doing it wrong or hudi doesn't support this way to query data.
Code
Environment Description
Hudi version : 1.12.0
Hadoop version : 3.1.3
Flink version: 1.13.6