MonetDB / MonetDB

This is the official mirror of the MonetDB Mercurial repository. Please note that we do not accept pull requests on github. The regression test results can be found on the MonetDB Testweb https://www.monetdb.org/testweb/web/status.php .For contributions please see: https://www.monetdb.org/documentation/dev-guide/
http://www.monetdb.org
Other
380 stars 55 forks source link

Insertion is too slow #7151

Closed aashish0074 closed 3 years ago

aashish0074 commented 3 years ago

Hi,

I'm inserting 74500 records through batch (JDBC driver) and it is taking around 15 minutes. Why the insertion is too slow? We don't want to insert data through csv

PedroTadim commented 3 years ago

It is difficult to figure out with so many possible causes. A trace of the insert would be a nice start to figure out the issue.

yzchang commented 3 years ago
aashish0074 commented 3 years ago

Hi @yzchang

. Inserting records without auto-commit mode (tried inserting all the records at once & in the batch of 5000 as well) . Ubuntu . Version MonetDB Database Server Toolkit v11.39.17 (Oct2020-SP5) . yes it is same machine where I'm running my code with monetDB

aashish0074 commented 3 years ago

@PedroTadim How we can exactly trace the insertion? I have read some documents and found that bulk insertion is much faster through CSV files. My use-case is we need to insert approx 40Million records in the monetdb on the hourly basis

aashish0074 commented 3 years ago

++ Our use-case is we need to insert approx 40Million records in multiple tables on hourly basis

yzchang commented 3 years ago
aashish0074 commented 3 years ago

@yzchang . Already tried the trace command for a single row, I'm inserting 74500 records dynamically through my code so I can't track the same. . Timing cached reads: 35310 MB in 2.00 seconds = 17686.75 MB/sec Timing buffered disk reads: 2950 MB in 3.00 seconds = 982.86 MB/sec

. there are no concurrent queries

. for higher data, is the CSV approach is recommended?

yzchang commented 3 years ago

. for higher data, is the CSV approach is recommended?

absolutely

aashish0074 commented 3 years ago

@yzchang with JDBC approach this slowness is the known thing or am I only facing it? I have tried to insert the same data in my PostgreSQL and it took only 0.9 seconds

yzchang commented 3 years ago

how long does it take if you do the same insert using mclient? => then we'll know if it's a JDBC problem or INSERT INTO problem.

yzchang commented 3 years ago

PS> if you don't like to use CSV loading, you can also consider binary copy into. That's even (much) faster than CSV COPY INTO, but it requires more work from the user side.

njnes commented 3 years ago

how wide is the table your inserting into (ie how many attributes does it have)?

aashish0074 commented 3 years ago

@njnes Only 3 columns

njnes commented 3 years ago

74500 in 15 minutes is way too slow. Although we improved a lot on the too be release jul branch even the older oct release should do 75k rows in way less time. Could you try with mclient instead of jdbc to see the bare performance. What are the types of these 3 columns (just strings and numbers)?

aashish0074 commented 3 years ago

@njnes One is bigInt and the other 2 are text columns.

For the performance benchmark, how can I run insertion for 74500 records with mclient ? Is there any way to do it As I'm inserting all the records dynamically through JDBC

njnes commented 3 years ago

mclient wants a single (large) file (with start transaction/commit). One easy way to create this is to use msqldump -N -t tablename dbname. This will output the table including create table and start transaction/commit. So just removing the create table, would give the correct test file.
Then test with clean database and recreated empty table.

aashish0074 commented 3 years ago

@njnes with the large file (CSV) we have already tested with mclient, for 74500 records through CSV file taking less than 1 second but with the jdbc we are inserting all the same records in batch and taking 15 minutes

mvdvm commented 3 years ago

In order to analyze what is going on in your Java program we need to have the source code which details how you do the inserts using JDBC methods. Can you please send us the relevant source code (including connection, statements, etc) for analysis? Can you also tell me if you turned off the autocommit mode and do explicit commit control e.g. after all rows have been inserted?

Also I need to know which version of the MonetDB JDBC driver you are using. The latest version (monetdb-jdbc-3.1.jre8.jar) is available from: https://www.monetdb.org/downloads/Java/

You may also look at https://www.monetdb.org/Documentation/ServerAdministration/LoadingBulkData It explains the different possiblities to insert data for MonetDB.

aashish0074 commented 3 years ago

@mvdvm Yes, we are using the latest version of the MonetDB JDBC driver.

Configuration and source code as follows:

  1. Application context Bean
<bean id="dataSourceMonet" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
        <property name="url" value="jdbc:monetdb://localhost:54321/aashishTest?so_timeout=600000" />
        <property name="username" value="monetdb" />
        <property name="password" value="monetdb" />
    </bean>
  1. Set Autocommit to false
conn = (MonetConnection) jdbcTemplate.getDataSource().getConnection();
conn.setAutoCommit(false);
  1. Source code for insertion :
PreparedStatement ps = connection.prepareStatement(sql);
            int batchCount = 0;
            for(int i = 0; i<data.size();i++){
                Map<String,Object> dataMap =  data.get(i);
                int param = 1;
                for (String colName : columnAndDataTypes.keySet()) {
                    String originalColName = colName;
                    if ("functions".equalsIgnoreCase(colName)) {
                        colName = "Function";
                    }
                     ps.setObject(param,dataMap.get(colName),ColumnDataTypes.getDataTypeCode(columnAndDataTypes.get(originalColName)));
                    param++;
                }
                ps.addBatch();
                ++batchCount;
                if(batchCount % batchSize == 0 || batchCount == data.size()) {
                    ps.executeBatch();
                    ps.clearBatch();
                }

            }
            ps.close();

con.commit();
  1. We have tried to insert in multiple ways below is another source code example:
jdbcTemplate.batchUpdate(sql, data, batchSize,
                    new ParameterizedPreparedStatementSetter<Map<String, Object>>() {

                        @Override
                        public void setValues(PreparedStatement ps, Map<String, Object> argument)
                                throws SQLException {
                            int j = 1;
                            for (String colName : columnAndDataTypes.keySet()) {
                                if ("functions".equalsIgnoreCase(colName)) {
                                    colName = "Function";
                                }
                                ps.setObject(j, argument.get(colName));
                                j++;
                            }

                        }
                    });
mvdvm commented 3 years ago

Hi,

We have analysed why it is taking so much time and found there is a problem in the Oct2020 release server when used with prepared insert statements, as is the case in your use case.

Luckily we also have found a workaround for Oct2020 releases. Please execute the SQL statement: SET sys.optimizer = 'minimal_pipe' before doing the insertions. So in Java: Statement stmt = conn.createStatement(); stmt.execute("SET sys.optimizer = 'minimal_pipe'");

After all the insertions are done you can set the optimizer setting back to its original value via: stmt.execute("SET sys.optimizer = 'default_pipe'"); See: https://www.monetdb.org/Documentation/SQLreference/SQLSyntaxOverview#SET_OPTIMIZER

This optimizer setting should improve the insert performance, especially for large nr of inserted rows. Once tested, let us know if this issue can be closed.

FYI: The issue has already been corrected in the upcoming new Jul2021 release (which is not yet released). The new Jul2021 release will also have significant improved insert performance, especially when sys.optimizer = 'minimal_pipe' is set. So add it to your program also for future MonetDB releases.

Tip: If your table has table constraints (like primary key and/or unique key and/or foreign key) it helps to remove those constraints first before you insert large amounts of data into the table and add the constraints again to the table after the inserts are completed. Use SQL: ALTER TABLE ... DROP CONSTRAINT ... and ALTER TABLE ... ADD CONSTRAINT ... PRIMAY KEY (col1) for doing this. See: https://www.monetdb.org/Documentation/SQLreference/SQLSyntaxOverview#ALTER_TABLE_ADD_CONSTRAINT

PS: when using monetdb-jdbc-3.1.jre8.jar (or monetdb-jdbc-3.0.jre8.jar) you no longer need to call: ps.clearBatch(); after each ps.executeBatch(); This issue has been corrected already in monetdb-jdbc-3.0.jre8.jar and newer versions.

aashish0074 commented 3 years ago

@mvdvm thanks for the update. we will try this work-around. Can you please let us know when this new Jul2021 release is coming