Proposal: Introduce a new API to execute batch queries into multiple batches

kalaiyarasiganeshalingam commented 1 year ago

Summary

The existing API for batch execution doesn't have a way to execute multiple batches by dividing the input query with a given size. This will cause memory limitation issues when executing very large data. So, this proposal proposes an API to execute the batch queries into multiple batches of queries.

Goals

Provide an API to process the given set of queries into multiple batches with the batch size.
To avoid the OOM issue.

Motivation

As mentioned in the summary, the existing API may produce memory limitation issues and doesn't have a way to avoid this issue. This issue can be overcome by splitting the queries into multiple batches with the batch size. The motivation is to avoid memory limitation issues when executing the batch data.

Description

API Definition:

# Execute the given set of SQL queries by splitting them into multiple batches. Only the metadata of the execution is returned (not the results from the query).
# If one of the commands in the batch fails, an `sql:BatchExecuteError` will be returned. However, the driver may
# or may not continue to process the remaining commands in the batch after a failure.
#
# + sqlQueries - The SQL query with multiple sets of parameters
# + batchSize - The maximum number of queries executed per execution. The default size is 1000.
# + return - Metadata of the query execution as an `sql:ExecutionResult[]` or an `sql:Error`
remote isolated function executeBatches(ParameterizedQuery[] sqlQueries, int batchSize = 1000) returns ExecutionResult[]|Error;

niveathika commented 1 year ago

Can't we have the batchSize as client configuration?

kalaiyarasiganeshalingam commented 1 year ago

We can do it. But, the same count of data may have different sizes(storage). It depends on the data type and values. Batch size depends on the size of the data. So, if we configure it as a client configuration, we can't change it according to the data size.

daneshk commented 1 year ago

There are two use cases we need to consider,

User wants to use the same batch size for all batchExecute calls within the application.
- In this case, having this as client configuration is easier as users have to configure it in one place and it applies to all places.
User wants to have different batch sizes of each batchExecute call within the application.
- In this case, having this as a function parameter is useful.

both are valid cases. @kalaiyarasiganeshalingam can you check how the spring framework handles if we didn't specify the batch size? are they going with a default hard-coded batch size or getting from another place like config?

https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/jdbc/core/JdbcTemplate.html#batchUpdate%28java.lang.String,java.util.Collection,int,org.springframework.jdbc.core.ParameterizedPreparedStatementSetter%29

kalaiyarasiganeshalingam commented 1 year ago

According to the Spring framework[1][2], the block size is configured on the method call. If batchUpdate() is called without a batch size, the batch size is the length of batchArgs. [1] https://docs.spring.io/spring-framework/docs/3.0.0.M4/reference/html/ch12s04.html [2] https://mkyong.com/spring/spring-jdbctemplate-batchupdate-example/

ballerina-platform / ballerina-library