Non-intuitive messytables behavior

ThrawnCA commented 4 years ago

In testing XLoader and the Data Dictionary recently, I noticed that after setting just one column type to 'timestamp' on a new resource and re-submitting it, the whole Data Dictionary was populated with appropriate types. Checking the Datastore log, I realised that this was because our non-ISO dates (dd/mm/yyyy) had caused a direct PostgreSQL COPY to fail, so it fell back to using messytables - which meant that it guessed all of our types.

Although this behaviour is actually rather useful, it seems rather inconsistent that you get full type-guessing by providing invalid input, yet no such guessing if your data is entirely valid. It's also unfortunate that it presumably replaces any Data Dictionary overrides you may have entered manually.

So, there are two issues:

If there are any existing Data Dictionary overrides, then messytables probably should not be permitted to overwrite them.
How easily could the use of messytables be made configurable? It occurs to me that it would be useful to have at least a setting to use messytables on first creation of a resource, while attempting a fast COPY for further updates. I'm sure that there are also a lot of potential XLoader users (possibly even my own workplace) who would prefer the maximum-backwards-compatibility option of always using messytables, despite the more modest speed gains compared to DataPusher - and as a side benefit, this wouldn't result in the log warning when PostgreSQL COPY fails (which can confuse non-technical users).

ThrawnCA commented 4 years ago

Looks like #36 already addressed the first point.

davidread commented 4 years ago

Good points. This is all an excellent area to improve, it's not something me or my client are actively pursuing, so I'm happy for you to do this.

Definitely when messytables loads it should respect the Data Dictionary.

On detecting types in general use, I'm not keen to rely too much on the messytables load because it does a tiresome conversion to chunks of JSON and then INSERT statements - requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

However we might still use messytables' type detection ability. I think it scans the first chunk of the CSV and makes a simple decision - I can't remember what the basis is. We could try setting the column types on that basis (the datastore_create) and try the COPY, and if that fails revert to string columns and redo the COPY?

A better alternative might be to still load initially as strings, then use messytables or some other heuristic to guess types, and then see if postgres itself can successfully cast a column to that type. If it can, we set the Data Dictionary for next time. And if next time it fails, then maybe it reverts to strings, COPY and then try the casting thing again.

Dunno what's best, but maybe that sparks some ideas, or a basis for experimentation?

ThrawnCA commented 4 years ago

Well, we have a bunch of data that uses dd/MM/yyyy dates, and will continue to do so for the foreseeable future, which our clients want to have parsed and presented via the API as dates. So continuing to use messytables for those will probably be necessary for us.

(Incidentally, it's further counter-intuitive that messytables will set the Data Dictionary for those fields to 'timestamp', but then they'll fail PostgreSQL COPY because it can't recognise them as timestamps.)

I really think that if nothing else, it would be good to have a "compatibility mode" to just keep using messytables and behaving more like the DataPusher. You're right that it's potentially less performant than a direct COPY, but sometimes it's what people would prefer, especially since it means that they don't have to change their processes. And if someone knows that their data will fail COPY and will fall back every time, then it's actually faster to just use messytables immediately.

requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

While that is true, it may not matter much in cases where CKAN is in a load-balanced (and possibly auto-scaled) cluster (which ours is).

ThrawnCA commented 4 years ago

If it can, we set the Data Dictionary for next time.

Hmm. This sounds like a potential enhancement to the new paster migrate_types command, attempting to optimise the Data Dictionary for the capabilities of COPY and messytables.

I can think of several possible scenarios:

Data Dictionary overrides exist. Test them with a dry-run COPY.
- COPY succeeds. Do nothing.
- COPY fails. Flag resource for messytables-only loads? Revert to strings?
Data Dictionary overrides do not exist. Guess potential overrides using a dry-run messytables.
- COPY succeeds with all overrides in place. Apply overrides permanently.
- One or more overrides make COPY fail. Populate all overrides from messytables and flag for messytables-only loads in future? Keep using strings? Populate only the successful overrides?
Resource is already flagged for messytables-only loads. Re-test COPY anyway?

This would likely be quite slow, but it's an interesting idea.

ThrawnCA commented 2 years ago

On detecting types in general use, I'm not keen to rely too much on the messytables load because it does a tiresome conversion to chunks of JSON and then INSERT statements - requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

It's actually still not necessarily the case. You totally can run XLoader on a separate machine; we've been doing that on www.data.qld.gov.au for years now.

davidread commented 2 years ago

@ThrawnCA I'm afraid I'm no longer involved, so I'll leave this with you and the CKAN community - best wishes

ckan / ckanext-xloader

Non-intuitive messytables behavior #90