ethlo / my2ch

Scheduled transfer of full MySQL query results to automatically generated, corresponding Clickhouse tables. Extremely useful for extracting data for analysis through ClickHouse.
Apache License 2.0
2 stars 0 forks source link

Gaps in synchronised records #7

Open wodzuu opened 12 months ago

wodzuu commented 12 months ago

Running My2Ch on large data set results in gaps in synchronised records. To reproduce run the tool on the database of size of several hundreds of millions of records and run counts on both MySQL and Clickhouse tables. You will see different results.

Most likely this is caused by the fact that data is not queried from MySQL in any order. Adding "ORDER BY {primary-key}" to the transfer query might solve the problem however it introduces another one: Sorting on SELECTs from view is slow (compared to sorting on raw SELECTS from tables in MySQLs). Is the solution with view really needed?

Another idea is to skip ordering the results in favour of synchronisation of the data in pages of 1mln (1mln creates one page in Clickhouse and insert is atomic).

ethlo commented 12 months ago

I assume the simplest would be to just order by the PK in incremental transfers. It also makes sense to keep the number of transferred rows below the threshold for parts, so we maintain atomicity if something goes wrong.

wodzuu commented 11 months ago

Yes, that would be the simplest although not the fastest. My experiments have shown that doing ... WHERE id >= n*1000000 AND id < (n+1)*1000000 in a loop over n (because we don't really care about the order within this window) gives much better results on large data sets. But I realise this may not be a generic solution for all cases.

ethlo commented 11 months ago

A view is MySQL is nothing more than a live view, so there is no difference in performance using a view or a bigger query when sorting the data. I could add the range transfer as an option though.