ClickHouse / clickhouse-jdbc-bridge

A JDBC proxy from ClickHouse to external databases
Apache License 2.0
168 stars 60 forks source link

batch insert - too slow #123

Closed mikeTWC1984 closed 2 years ago

mikeTWC1984 commented 2 years ago

I was testing insert query like below (for jdbc engine table).

insert into jdbc_test(a, b)
select number*3, number*5 from system.numbers n  limit 5000

I was using mysql and mssql drivers. In both cases results were very slow - 200-300 rows per second (using tiny table with 2 int columns). This is about the same speed as if you'd insert rows one by one with auto commit. rewriteBatchedStatements=true was set. I did few more tests using plain java for mysql, if I turn off auto commit I can get 7-8K rows per second, even if inserting one by one. Using batch api this goes up to 80K per second.

So sounds like when bridge is writing data it's using very small batches (or not using batches at all?) Regarding auto commit - it is true by default , however during batch writing I think it's more logical to start transaction tough to avoid partial inserts (on crash). It would also speed up inserts with smaller batches.

Here is the quick and dirty snippet for testing. It assumes there is data base "test" and table "test" with (a int, b int) columns.


import java.sql.DriverManager;
import java.sql.SQLException;

/* usage

// copy 2000 rows 1 by 1, with auto commit
java --source 11 -cp "./mysql-connector-java-8.0.26.jar" jt.java 2000

// copy 50K rows in 1000 row batches with auto commit (after each batch insertion)
java --source 11 -cp "./mysql-connector-java-8.0.26.jar" jt.java 50000 1000

// copy 50K rows in 1000 row batches, single transaction
java --source 11 -cp "./mysql-connector-java-8.0.26.jar" jt.java 50000 1000 false

*/

class Jt {
    public static void main(String[] args) {

        var rowsToCopy = args.length > 0 ? Integer.parseInt(args[0]) : 1000;
        var batchSize = args.length > 1 ? Integer.parseInt(args[1]) : 1;
        boolean autoCommit = args.length > 2 ? Boolean.parseBoolean(args[2]) : true;

        try {

            long start = System.currentTimeMillis();

            var conn = DriverManager.getConnection(
                    "jdbc:mysql://localhost/test?user=root&password=root&rewriteBatchedStatements=true");

            conn.setAutoCommit(autoCommit);

            var stmt = conn.prepareStatement("insert into test.test values(?,?)");

            if (batchSize == 1) {
                System.out.println("No batching");
                for (int i = 1; i <= rowsToCopy; i++) {
                    stmt.setInt(1, i * 3);
                    stmt.setInt(2, i * 5);
                    stmt.executeUpdate();
                }
            } else {
                System.out.println("batch size: " + batchSize);
                for (int i = 1; i <= rowsToCopy; i++) {
                    stmt.setInt(1, i * 3);
                    stmt.setInt(2, i * 5);
                    stmt.addBatch();
                    if (i % batchSize == 0)
                        stmt.executeBatch();
                }
                stmt.executeBatch();
            }
            if (!autoCommit)
                conn.commit();

            long finish = System.currentTimeMillis();
            long dur = (finish - start);
            System.out.println(String.format("copied %d rows in %,d ms (%d rows per sec). Autocommit: %b", rowsToCopy,
                    dur, rowsToCopy * 1000 / dur, autoCommit));

            var s = conn.createStatement();
            var rs = s.executeQuery("select sum(1) from test.test");
            rs.next();
            System.out.println(String.format("Current row count: %d", rs.getInt(1)));
            conn.close();

        } catch (SQLException ex) {
            // handle any errors
            System.out.println("SQLException: " + ex.getMessage());
            System.out.println("SQLState: " + ex.getSQLState());
            System.out.println("VendorError: " + ex.getErrorCode());
        }

    }
}
zhicwu commented 2 years ago

Thanks for spending time on this. There's overhead but it should not be that slow.

Anyway, below is what I got (using quick start to spin up the test environment in local):

I think I can optimize batch insertion a bit by disabling auto commit and then commit batch by batch. However, for your case, perhaps it could relate to network between clickhouse and jdbc bridge, or jdbc bridge and mysql?

zhicwu commented 2 years ago

On a side note, in case you need to transfer massive data across DCs/regions:

mikeTWC1984 commented 2 years ago

OK, thanks for the update and tips! I was doing all this in docker on my local machine too, so those numbers looked very suspicious. I tried your docker compose setup and it works as expected now. I'll try to figure out what was wrong before. I'm planning to use it with mssql/oracle, will continue experimenting. Regarding auto commit - I mainly wold be concern about partial inserts (rather than performance). Say if you run long insert and kill the process in the middle it will require manual clean up afterwards.

mikeTWC1984 commented 2 years ago

I'm having some issue with oracle 1) default test query is "select 1", but oracle needs "select 1 from dual". I see it can be configured, but probably it would make more sense to use Connection.IsValid method to test connection by default. 2) jdbc bridge enquotes table and schema names when you create jdbc table. Turns out oracle treats tableName and "tableName" as different tables. Is there an option to avoid this quotation? I believe DB2/Informix also treats quotes same way. I think quotation should be disabled by default, since user can add it as needed while creating table.

I also tested mssql, I had no issue. For batch insert it also needs extra property (useBulkCopyForBatchInsert=true) to get reasonable perfomance.

zhicwu commented 2 years ago

Thanks for sharing your findings. Is there a docker image for Oracle that I can use to reproduce the issue? I'll need to debug the code to understand what could be cause.

mikeTWC1984 commented 2 years ago

driver: https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.3.0.0/ojdbc8-21.3.0.0.jar

docker: docker run -d -p 1521:1521 -e ORACLE_PASSWORD=123456 gvenzl/oracle-xe

jdbc url: jdbc:oracle:thin:@//localhost:1521/XEPDB1 user: SYSTEM pass: 123456

test ddl


CREATE TABLE system.test (a int, b int)
INSERT INTO system.test SELECT 1, 2 FROM dual

CREATE TABLE system."test" (a int, b int)
INSERT INTO system."test" SELECT 3, 4 FROM dual

SELECT * FROM system.test
SELECT * FROM system."test"
mikeTWC1984 commented 2 years ago

Speaking of docker - I see jdbc bridge image is using java 8. I see oracle/mssql release drivers for specific jdk versions, so newer ones are not compatible with jdk8. Can you make separate image version for java 11, or it's not compatible?

mikeTWC1984 commented 2 years ago

OK, I realized quotation might not be an issue. If you do not enqoute table name, oracle just convert it to all uppercase. But if you use quotes it become case sensitive. Basically TEST and test are different tables (not test and "test")

mikeTWC1984 commented 2 years ago

OK, I can confirm that as long you keep column/table names in uppercase (for oracle) there is no any issue. Batch insert also works good without tuning any extra jdbc parameter, although I'd probably set bigger default batch_size. I'm going to close that issue then. There are 2 other items I mention here - about autocommit and connection testing (replacing "select 1" with Conneciton.IsValid method), if you think those are somewhat reasonable I will open another issue

zhicwu commented 2 years ago

Please be aware that increasing batch_size comes at a cost. The larger batch_size the higher chance JDBC bridge will run of out memory, because it has to hold the whole batch in memory before sending over to target database. I'd suggest to set a reasonable number by considering row size(column count and size of each column etc.), concurrency, SLA, and JDBC bridge memory configuration etc. together. On a side note, fetch_size has similar issue but it's just for query.

As to either use validation query like "select 1" or standard JDBC API Connection.isValid(), it has nothing to do with JDBC bridge but HikariCP, the connection pool implementation. However, I do see the headache of tuning configuration for different databases - we should have templates defined in advance so that datasource configuration is just about host, port, database, and credentials. I didn't mention timeout here but I hope we can have better to configure that as well.

Lastly, to recap issues we discussed in this thread: