Datapusher enhancements

rossjones commented 9 years ago

Some user stories for datapusher that might be useful to have

As a system administrator I want clarification on where I see the datapusher status So that I am not in the dark about failures

As a system administrator I want datapusher to be even easier to deploy So that I don’t have to change config in .py files.

As a developer I want datapusher to better handle ‘errors’ So that it doesn’t crash with extra header rows

As a user I want to see a status page showing if/why the datapusher failed So that I can fix my data and try again

As a developer I want datapusher to handle all sheets in an XLS So that I don’t miss out on data

Aaron-M commented 9 years ago

Couple of additions/comments on use cases:

"As a developer I want datapusher to better handle ‘errors’ So that it doesn’t crash with extra header rows" and so that it correctly handles/identifies the format of data columns (especially not identifying stuff as date time that is not ref issues #1963 and #1964)

"As a developer I want datapusher to handle all sheets in an XLS So that I don’t miss out on data" BUT there may be cases when some sheets I do not want/are not suitable and I want to be able to exclude them.

The last one re all sheets in XLS (and XLSX) I actually get our users to save each worksheet as a csv or tsv file and upload separately so they do go into the datastore. This gives us a non-proprietary format version of the data being stored alongside the excel version. For the excel I add a sheet for metadata, and a 'contents' sheet as the last worksheet in which I create an index of what worksheets the excel file contains, so anyone previewing the excel can see there are multiple sheets (and refer to the csv versions to view). So I'm somewhat on the fence as to the merit of pushing in all excel sheets (or multiple but not all). Possibly a config option to toggle that on or off would be useful.

We are very close to finalising a tool (v1) which is an addin for excel that does some QA on the data, helps record metadata about the project/descriptions of the fields, creates a 'contents' worksheet, and then posts to our CKAN repository (as either just the excel file, or (recommended) as tab delimited text, or both). We will make available to others to use, and this functionality for us would negate the need for each excel worksheet to be 'datapushed'. Aiming for end of Feb for this.

Starl3n commented 9 years ago

Just a quick note that Link Digital should be able to take this one on and get it done for CKAN 2.4.

Or, at least get a number of improvements made related to the idea :)

rufuspollock commented 9 years ago

My 2c on this one is that we should probably get this out of CKAN core and really decouple as much as we can. Good ETL is hard and comes close to full "AI" when you are trying to guess where the data starts and ends in some CSV. I'd therefore vote more in terms of things like #18 (ckan import app) which are standalone and integrate with CKAN over a relevant API. Of course, I do get the desire for a seamless UX so we do need to think how we effectively "delegate out" from CKAN (or integrate the app in).

ageara commented 9 years ago

As a system administrator I would like to be able to schedule a revisit interval for externally hosted / linked (not Filestore) data resources so that those in the DataStore do not go stale.

ckan / ideas

Datapusher enhancements #124