pgloader-powered datapusher

jqnatividad commented 7 years ago

Datapusher currently uses the datastore_create API to push data to the datastore asynchronously in chunks of 250 rows after guessing the table schema with messytables by scanning the first 1000 rows. If there is an error because of datatype casting issues beyond the first 1000 rows (e.g. N/A in a numeric field), the async pusher job stops.

What if instead of using the datastore_create API, just bulk load the data in a more reliable, exponentially faster manner with pgloader.io, with the ability to handle casting errors gracefully (doesn't stop the load job, it continues and just reports which rows had problems)

We can still use messytables to guess the schema, and leverage this info to configure the pgloader.io job, and have it run async.

The report that pgloader produces can even be used to report back to the admin the result of the datapusher job.

This at least gives the dataset curator some actionable info to fix the problem, unlike what we have now.

davidread commented 7 years ago

Just to note that we're advancing this in: https://github.com/ckan/ckan/issues/3517

jqnatividad commented 7 years ago

Closing since this is being worked on https://github.com/ckan/ckan/issues/3517

ckan / ideas

pgloader-powered datapusher #197