Datapusher currently uses the datastore_create API to push data to the datastore asynchronously in chunks of 250 rows after guessing the table schema with messytables by scanning the first 1000 rows. If there is an error because of datatype casting issues beyond the first 1000 rows (e.g. N/A in a numeric field), the async pusher job stops.
What if instead of using the datastore_create API, just bulk load the data in a more reliable, exponentially faster manner with pgloader.io, with the ability to handle casting errors gracefully (doesn't stop the load job, it continues and just reports which rows had problems)
We can still use messytables to guess the schema, and leverage this info to configure the pgloader.io job, and have it run async.
The report that pgloader produces can even be used to report back to the admin the result of the datapusher job.
This at least gives the dataset curator some actionable info to fix the problem, unlike what we have now.
Datapusher currently uses the datastore_create API to push data to the datastore asynchronously in chunks of 250 rows after guessing the table schema with messytables by scanning the first 1000 rows. If there is an error because of datatype casting issues beyond the first 1000 rows (e.g. N/A in a numeric field), the async pusher job stops.
What if instead of using the datastore_create API, just bulk load the data in a more reliable, exponentially faster manner with pgloader.io, with the ability to handle casting errors gracefully (doesn't stop the load job, it continues and just reports which rows had problems)
We can still use messytables to guess the schema, and leverage this info to configure the pgloader.io job, and have it run async.
The report that pgloader produces can even be used to report back to the admin the result of the datapusher job.
This at least gives the dataset curator some actionable info to fix the problem, unlike what we have now.