ClickHouse / clickhouse-jdbc-bridge

A JDBC proxy from ClickHouse to external databases
Apache License 2.0
167 stars 60 forks source link

Out of memory with default settings while extracting data from 200M rows table in PostgreSQL #161

Open ruslanen opened 2 years ago

ruslanen commented 2 years ago

Hi, I have a trouble with JDBC Bridge. I am trying to execute create table query in ClickHouse with JDBC based on PostgreSQL datasource:

CREATE TABLE `16cd958e64e341429b41226d986201d6_etl_intermediate_factonlinesales`
ENGINE = MergeTree
ORDER BY tuple()
AS
SELECT onlinesaleskey,datekey,productkey,promotionkey,customerkey,salesordernumber,salesorderlinenumber,salesquantity,salesamount,returnquantity,returnamount,discountquantity,discountamount,totalcost,unitcost,unitprice
FROM 
(
    SELECT * FROM jdbc('jdbc:postgresql://host:5432/microsoft_contoso_integration_200M?user=postgres&password=&socket_timeout=1000000&max_block_size=1024', 'select * from factonlinesales')
);

but I get following error in JDBC Bridge:

[2022-10-23 10:44:43] [INFO   ] Executing query(schema=[]):
select * from factonlinesales 
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2022/10/23 10:46:48 - please wait.
Exception in thread "vertx-blocked-thread-checker" Exception in thread "SimplePauseDetectorThread_0" Exception in thread "vertx-blocked-thread-checker" JVMDUMP007I JVM Requesting Tool dump using 'kill -9 13'
JVMDUMP011I Tool dump created process 103
./docker-entrypoint.sh: line 5:    13 Killed                  java -XX:+UseContainerSupport -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -Xdump:none -Xdump:tool:events=systhrow+throw,filter=*OutOfMemoryError,exec="kill -9 %pid" -Djava.util.logging.config.file=$JDBC_BRIDGE_HOME/logging.properties -Dnashorn.args=--language=es6 ${JDBC_BRIDGE_JVM_OPTS:=""} -jar clickhouse-jdbc-bridge-shaded.jar

I tried to use different values of the setting max_block_size, bit it didn't help. Here are my questions:

  1. How I can avoid OOM? Can I do this without increasing JVM_ARGS?
  2. Should JDBC Bridge use the streaming while processing query with large data like my datasource? (https://github.com/ClickHouse/clickhouse-jdbc-bridge/blob/7faf5868040238251bd02f54eec4a8a50c0731fb/src/main/java/com/clickhouse/jdbcbridge/impl/JdbcDataSource.java#L490)
  3. Which settings I need to tune in JDBC Bridge?

My environment: Ubuntu 18.04, 12Gb RAM, 8 CPU

clickhouse/jdbc-bridge:2.1.0 clickhouse/clickhouse-server:22.7.2.15

JDBC Bridge container with default JVM settings.

PostgreSQL datasource table:

select count(*) from factonlinesales

200 000 000

SELECT pg_size_pretty( pg_total_relation_size('factonlinesales') );

36 GB

UnamedRus commented 2 years ago

https://github.com/ClickHouse/clickhouse-jdbc-bridge/blob/master/src/main/java/com/clickhouse/jdbcbridge/impl/JdbcDataSource.java#L464

And from: https://jdbc.postgresql.org/documentation/query/#getting-results-based-on-a-cursor

// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();

Looks like for PostgreSQL in order to use batching of results, you need to disable auto commit.

ruslanen commented 2 years ago

https://github.com/ClickHouse/clickhouse-jdbc-bridge/blob/master/src/main/java/com/clickhouse/jdbcbridge/impl/JdbcDataSource.java#L464

And from: https://jdbc.postgresql.org/documentation/query/#getting-results-based-on-a-cursor

// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();

Looks like for PostgreSQL in order to use batching of results, you need to disable auto commit.

Thank you. It's strange that setAutoCommit can't be configured.

ruslanen commented 2 years ago

setAutoCommit(false) fix and rebuild JDBC bridge helped, thank you @UnamedRus it remains only to understand what bad it can lead to

ogirardot commented 1 year ago

Hi @ruslanen can you go through your configurations and changes, it's almost a year later and I'm facing the same issues trying to tweak clickhouse, the clickhouse-jdbc-bridge and the jdbc driver params to make it work without any OOM or timeout on a 2M table with large CLOBs

UnamedRus commented 1 year ago

autocommit was PostgreSQL behavior thing, and as i can guess from CLOB, you are using Oracle, so it may have some different issue