ckan / datapusher

A standalone web service that pushes data files from a CKAN site resources into its DataStore
GNU Affero General Public License v3.0
77 stars 153 forks source link

Datapusher COPY mode #221

Closed jqnatividad closed 2 years ago

jqnatividad commented 3 years ago

Resolves #220.

When a queued file's size is > COPY_MODE_SIZE (bytes) and a COPY_ENGINE_WRITE_URL is specified, datapusher uses Postgres COPY, similar to xloader, otherwise, use existing datapusher logic.

The main difference being, that we still use messytables to guess the column data types, which xloader currently doesn't do.

Also fixes #219, as the Datastore message is now more informative.

datapusher-copy

jqnatividad commented 3 years ago

Currently working on an implementation where a typical resource is in the millions of rows.

At the time I started, xloader was not migrated to 2.9 yet, so I used datapusher and tried to improve it with HA capabilities, which were merged upstream.

Even with multiple workers, it was still too slow for my use case, and investigated switching to xloader. However, xloader doesn't have type-guessing, which my client requires - thus this PR.

I borrowed heavily from xloader, :)

It is very fast now, even a tad faster than xloader - squeezed some more performance out of COPY by using the COPY FREEZE option - obviating the need to maintain WAL logs during COPY, and doing a VACUUM ANALYZE immediately after.

@amercader @davidread appreciate your review/feedback.

jqnatividad commented 3 years ago

Took the opportunity to fully fix #219. Messages are now only emitted every 20 seconds, and also includes rowcount and elapsed time.

Here are some benchmarks for the same file - a sample of NYC 311 calls with 100k rows on a VirtualBox VM running Postgres 11 and CKAN 2.9.1 on an 8gb Ubuntu 16.04 instance .

Existing datapusher process: 288.17 seconds Using COPY: 77.24 seconds, including a VACUUM ANALYZE.

Screenshot from 2021-01-03 17-18-38 Screenshot from 2021-01-03 17-19-36

jqnatividad commented 2 years ago

Closing this now that there is https://github.com/dathere/datapusher-plus