ckan / ckanext-xloader

Express Loader - quickly load data into DataStore. A replacement for DataPusher.
GNU Affero General Public License v3.0
46 stars 50 forks source link

Non-intuitive messytables behavior #90

Open ThrawnCA opened 4 years ago

ThrawnCA commented 4 years ago

In testing XLoader and the Data Dictionary recently, I noticed that after setting just one column type to 'timestamp' on a new resource and re-submitting it, the whole Data Dictionary was populated with appropriate types. Checking the Datastore log, I realised that this was because our non-ISO dates (dd/mm/yyyy) had caused a direct PostgreSQL COPY to fail, so it fell back to using messytables - which meant that it guessed all of our types.

Although this behaviour is actually rather useful, it seems rather inconsistent that you get full type-guessing by providing invalid input, yet no such guessing if your data is entirely valid. It's also unfortunate that it presumably replaces any Data Dictionary overrides you may have entered manually.

So, there are two issues:

ThrawnCA commented 4 years ago

Looks like #36 already addressed the first point.

davidread commented 4 years ago

Good points. This is all an excellent area to improve, it's not something me or my client are actively pursuing, so I'm happy for you to do this.

Definitely when messytables loads it should respect the Data Dictionary.

On detecting types in general use, I'm not keen to rely too much on the messytables load because it does a tiresome conversion to chunks of JSON and then INSERT statements - requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

However we might still use messytables' type detection ability. I think it scans the first chunk of the CSV and makes a simple decision - I can't remember what the basis is. We could try setting the column types on that basis (the datastore_create) and try the COPY, and if that fails revert to string columns and redo the COPY?

A better alternative might be to still load initially as strings, then use messytables or some other heuristic to guess types, and then see if postgres itself can successfully cast a column to that type. If it can, we set the Data Dictionary for next time. And if next time it fails, then maybe it reverts to strings, COPY and then try the casting thing again.

Dunno what's best, but maybe that sparks some ideas, or a basis for experimentation?

ThrawnCA commented 4 years ago

Well, we have a bunch of data that uses dd/MM/yyyy dates, and will continue to do so for the foreseeable future, which our clients want to have parsed and presented via the API as dates. So continuing to use messytables for those will probably be necessary for us.

(Incidentally, it's further counter-intuitive that messytables will set the Data Dictionary for those fields to 'timestamp', but then they'll fail PostgreSQL COPY because it can't recognise them as timestamps.)

I really think that if nothing else, it would be good to have a "compatibility mode" to just keep using messytables and behaving more like the DataPusher. You're right that it's potentially less performant than a direct COPY, but sometimes it's what people would prefer, especially since it means that they don't have to change their processes. And if someone knows that their data will fail COPY and will fall back every time, then it's actually faster to just use messytables immediately.

requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

While that is true, it may not matter much in cases where CKAN is in a load-balanced (and possibly auto-scaled) cluster (which ours is).

ThrawnCA commented 4 years ago

If it can, we set the Data Dictionary for next time.

Hmm. This sounds like a potential enhancement to the new paster migrate_types command, attempting to optimise the Data Dictionary for the capabilities of COPY and messytables.

I can think of several possible scenarios:

This would likely be quite slow, but it's an interesting idea.

ThrawnCA commented 2 years ago

On detecting types in general use, I'm not keen to rely too much on the messytables load because it does a tiresome conversion to chunks of JSON and then INSERT statements - requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

It's actually still not necessarily the case. You totally can run XLoader on a separate machine; we've been doing that on www.data.qld.gov.au for years now.

davidread commented 2 years ago

@ThrawnCA I'm afraid I'm no longer involved, so I'll leave this with you and the CKAN community - best wishes