clingen-data-model / clinvar-streams

1 stars 0 forks source link

Improve write performance in clinvar combiner #48

Closed theferrit32 closed 2 years ago

theferrit32 commented 2 years ago

I implemented an batch insert query builder in the combiner which iterates over the messages from kafka, partitioned by release sentinels, partitioned again by table, partitioned again by a batch size of 500, and it increased the write performance to sqlite from ~400 to ~910 messages per second.

sqlite does have a default maximum query size which means there needs to be a numeric limit to the number of messages included in a batch insert. However it may be possible to use the java.sql.PreparedStatement/addBatch functionality to achieve similar i/o throughput improvement from batching writes.

theferrit32 commented 2 years ago

Increasing the size of the ssd persistent disk from 50GiB to 100GiB also had a benefit roughly doubling the observed messages per second write speed. This crossed one of the speed thresholds in google cloud (64GiB) so that makes sense: https://cloud.google.com/compute/docs/disks/performance#performance_by_disk_size

theferrit32 commented 2 years ago

Will check in current state of the code and then evaluate PreparedStatement/addBatch https://docs.oracle.com/en/java/javase/11/docs/api/java.sql/java/sql/PreparedStatement.html#addBatch()

Statement/addBatch accepts new sql statement strings in each call, but since our stream messages are well grouped by entity type, the PreparedStatement/addBatch will work fine, reusing the same insert sql on each call until the next entity type is seen.