Closed jqnatividad closed 2 years ago
Currently working on an implementation where a typical resource is in the millions of rows.
At the time I started, xloader was not migrated to 2.9 yet, so I used datapusher and tried to improve it with HA capabilities, which were merged upstream.
Even with multiple workers, it was still too slow for my use case, and investigated switching to xloader. However, xloader doesn't have type-guessing, which my client requires - thus this PR.
I borrowed heavily from xloader, :)
It is very fast now, even a tad faster than xloader - squeezed some more performance out of COPY by using the COPY FREEZE option - obviating the need to maintain WAL logs during COPY, and doing a VACUUM ANALYZE immediately after.
@amercader @davidread appreciate your review/feedback.
Took the opportunity to fully fix #219. Messages are now only emitted every 20 seconds, and also includes rowcount and elapsed time.
Here are some benchmarks for the same file - a sample of NYC 311 calls with 100k rows on a VirtualBox VM running Postgres 11 and CKAN 2.9.1 on an 8gb Ubuntu 16.04 instance .
Existing datapusher process: 288.17 seconds Using COPY: 77.24 seconds, including a VACUUM ANALYZE.
Closing this now that there is https://github.com/dathere/datapusher-plus
Resolves #220.
When a queued file's size is >
COPY_MODE_SIZE
(bytes) and aCOPY_ENGINE_WRITE_URL
is specified, datapusher uses Postgres COPY, similar to xloader, otherwise, use existing datapusher logic.The main difference being, that we still use messytables to guess the column data types, which xloader currently doesn't do.
Also fixes #219, as the Datastore message is now more informative.