OpenDataServices / flatten-tool

Tools for generating CSV and other flat versions of the structured data
http://flatten-tool.readthedocs.io/en/latest/
MIT License
101 stars 18 forks source link

Pick up column titles from the nth line #130

Open timgdavies opened 7 years ago

timgdavies commented 7 years ago

In a spreadsheet it is sometimes useful to provide information above the column titles.

For example:

(1) To provide meta-data

License CC-BY
Downloaded from http://www.example.com/
Attributions Example Coop, Ordnance Survey
id grant_title grantee/name
01 Test Grant AnyCharity

(2) To provide additional field documentation

Identifier Grant Title Recipient
Identifiers should be unique Titles should be no more than 10 words long, and easy to understand for non-experts Include the name the recipient is known as, their legal name can be included elsewhere
id grant_title grantee/name
01 Test Grant AnyCharity

If, like HXL, we allow that the column headers can be given in any row (or any of the first N rows), and that only data below the column header row will be used, we could build much more flexible spreadsheets.

For example, the field title row could be hidden, but there to mark-up an existing tabular data sheet.

We have discussed whether this behaviour should be based on:

My preference is for auto-detection.

We discussed auto-detection being an option that needs to be explicitly turned on.

andylolz commented 7 years ago

Ooh, +1 for doing it the HXL way! This is particularly handy where there are deeply nested elements, so the column headers required by flatten-tool are not very human readable.

dalepotter commented 7 years ago

+1 for HXL too.

And like a field containing a short documentation string also sounds useful.

stevieflow commented 7 years ago

+1

johnadamsDFID commented 7 years ago

+1 Excellent idea.

dalepotter commented 7 years ago

As part of the iati.core python library (which aims to better support the the way IATI developers work with the IATI Standard and IATI data), I'm currently working on a user story which may be relevant to the related issue of outputting docs relating to each IATI XPath element.

As a IATI Tech Team Developer I want to build documentation sites that automatically extract definitions and such like from the Schemas and related places, so that I don’t need to write documentation from scratch and manually update it each time something changes.

We're imagining that implementation might be done by something like iati.core.Schema.get_documentation_string(xpath) This would return the English language documentation string.

Happy to provide more detail on the possible implementation of this, if this might be helpful...?

stevieflow commented 7 years ago

Some additional insight on this

@rory09 and I recently developed a template for gathering IATI relevant data from across several country offices of an organisation.

One aspect that helped users (we found) was to keep the xpath details at the top of the document, and slightly shaded out (to imply that they need not be touched!):

sheet

We found this worked well, in terms of results back. To be clear - we didnt provide a template with the path immediately above data cells, but this might be a story worth considering in light of the above

that only data below the column header row will be used

IN our model here, this would represent a challenge

timgdavies commented 7 years ago

It might be useful to think about the options that Open Refine gives when uploading a file

image

I.e. in this case we want to:

But in other cases might want the other combination of options.

Another design pattern to consider might be finding a way to mark rows in the first column as not-to-parse (e.g. hidden column A with # in for 'comment columns')

stevieflow commented 7 years ago

Yes In discussion with @Bjwebb we talked about utilising the meta tab to try and (manually) indicate this. This might be useful in terms of us trying different approaches, rather than trying (initially) to build features of UI, etc...

kindly commented 7 years ago

@Bjwebb and I had a further discussion about this. We have excluded auto-detection as an option as we think there are too many things that could go wrong with this. We have 3 options remaining, summarising the above and our discussion.

1. Saying how many (and what) lines to skip in the Metatab.

This means we will will have some kind of special options in the metatab that say how many rows to skip or if we skip row(s) after the heading line.

Pros

Cons

2. Let the first column in the spreadsheet be a comment to show the line is skipped.

For example:

#
# can put anything here
# Some Human Readable Title
actual heading line here
data starts from here

Pros

Cons

3. Let the first line be for configuration starting with a #.

If the first cell A1 is a # the rest of the first line is for configuring the sheet. We can have various configuration options on this line i.e ignore sheet, skip lines, skip after headings, vertical orientation.

For example:

# skip-line skip-afterheader
can put anything here
actual heading line here
Some Human Readable Title
data starts from here

Pros

Cons

timgdavies commented 7 years ago

Is there an option 4?

4. Skip the row when the first character of the row is # and use the first non-comment row as the header

#
# can put anything here
# Some Human Readable Title
actual heading line here
data starts from here
# I can also put a comment in the last line - - -
# E.g. the downloaded date - - -

Pros

Cons

kindly commented 7 years ago

@timgdavies we excluded options where the "data" not just the "headings" has limits to what can expressed. So 4 was excluded for that reason. We could have it that you are allowed comments continuously at the top but not again after that (i.e in the footer), but this means you would not be able to have the human readable headings after the normal headings. Also having # at the start of only the first human readable heading seems a bit odd.

My preference is for 3, even though it the most painful to implement.

We can have a way for the metatab to define the defaults for all sheets in its top hash line.

Common cases for this # line would be "#, ignore" or "#, vertical". Also, it solves the issue of how express metadata at the top (first point on this ticket) by having "#, metadata-lines-3", so we know the first three lines are lines of metadata explicitly. This way the metadata can parsed (not just skipped), which solves the metatab issue for a single csv file.

I suppose it is fairly ugly but instead of comments its a bit like a #! line at the top of files in order to know how to parse them.

timgdavies commented 7 years ago

Ok - you're selling me on option 3.

I've just been looking at:

W3C Tabular data dialect descriptions which provide a set of possible 'commands' (skipRows)

http://w3c.github.io/csvw/syntax/#tabular-data-embedding-annotations is also particularly useful as a reference source for us I think.

kindly commented 7 years ago

@timgdavies They look good references especially for naming conventions. i.e

# skipLines 1 headerRows 2
can put anything here
actual heading line here
Some Human Readable Title
data starts from here

I suppose we could also allow any number of continuous # lines at the start (like option 4) for comments, but keep the first line special for "commands".

stevieflow commented 7 years ago

My thinking is that on the example I provided above: https://github.com/OpenDataServices/flatten-tool/issues/130#issuecomment-316770891 - there would need to be another row for configuration. With styling and such, we can make it clear to the user to not touch this - but it could be a risk, but minor

@rory09 - any thoughts?

kindly commented 7 years ago

@stevieflow Just for clarity I hope that you will not need to have that line at the top of every sheet. The aim is that you can define the defaults for all sheets in the Metatab sheet.

rorads commented 7 years ago

I have a few thoughts on the issues mentioned above. I'm going to post the thought which is well formed now, and then have a think about the multiple heading rows discussion.

Commenting out columns and workbook sheets

I've often (and often seen publishers) put an extra column in my data for notes/quick tests using formulas, or put a whole extra sheet in some IATI/360 data to house values for calculations, pivot tables for quick reference, or even charts for overviewing data.

See the red tabs in this image:

image

For either a column or tab which a user doesn't want to convert, but which they don't want to delete every time they convert their data, I propose the user should be able to 'comment' out that column or sheet with by adding two underscores at the start, so "Org ID Index" would become "__Org ID Index" and a notes column would be "__notes".

N.B. I would have gone with a hashtag, but Tim rightly pointed out this would run counter to the precedent set by HXL, possibly resulting in confusion.

rorads commented 6 years ago

I can see the appeal of option 3, but I have some worries.

Convention inconsistencies between CSV and Excel files

IATI version must be be declared in the metatab in Excel, but there can't be a metatab for CSV files, so there would need to be very clear documentation, and even then it could be confusing / would mean that people would have to be careful in transferring data from csv to excel.

Users/publishers having to program their 2d data

The runaway success of HXL comes from its ability to take existing, easily consumed datasets and simply allowing a publisher to ad a new row in a predictable place and as long as they do that one step correctly, and as long as their data makes sense, they'll produce useful data.

This isn't to say that I think the existence of a command row at the top isn't a good idea - for superusers like myself it's great, but I think we should be very opinionated and somewhat protective of our users in setting very clear and canonical defaults, which allow the 90% of users who just want to keep their headings but add one row below/above them like in HXL.

One of the huge benefits of flatten-tool in my use is that it can be used to convert lightly marked up data, which makes sense as stand alone. Although it's not hard to apply an auto-filter or pivot table to a custom range, for example, it creates a small amount of friction which could stop people exploring the really powerful features of excel and/or IATI.

rorads commented 6 years ago

Wrong button. ![oh dear](facepalm.png)

rorads commented 6 years ago

So with my above call for defaults in mind, we could present users with the following kind of template in the case of Excel (with a meta-tab + IATI version included elsewhere in the workbook):

A B C D E
1 Activity ID Publisher ID Publisher Type Code Publisher Name ...
2 iati-identifier reporting-org/@ref reporting-org/@type reporting-org/narrative ...
3 GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
4 GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
5 GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...

Then, for a CSV, it could be something like:

# version 2.02 skip_rows 1 header_rows 2
Activity ID Publisher ID Publisher Type Code Publisher Name ...
iati-identifier reporting-org/@ref reporting-org/@type reporting-org/narrative ...
GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
GB-GOV-3-12345 GB-GOV-3 10 UK - Foreign & Commonwealth Office ...
kindly commented 6 years ago

@rory09

Thanks for your feedback. I have had the same concerns over auto-filtering, pivoting and other functions. HXL has this issue too.

I am happy to support HXL but the standard itself "as far as I can see" does not give us a clear way to say if the sheet is HXL formatted to begin with and flattentool has to know that. Any autodetection we do will be flawed and would be have to be replicated by anyone else consuming the spreadsheets.

I want to start working on this and there is not really a consensus on a solution.

For the time being I think I will go for solution 1 initially but with the syntax discussed in solution 3:

This way we would only allow the # line at the top of the Metatab page and anything configured there applies to all (non-metatab) sheets. Hardly anyone apart from the person making the spreadsheet template would be likely to touch that line. So a metatab page would look like:

# skipLines 1 headerRows 2
license CC
version 1.1
extensions

This keeps all the "data" sheets clean (no # at the top) and uniform. This makes for easier data manipulation of those sheets. It does limit us to not having per sheet formatting but that can be added later if it is seen as a need. Next we could have an exception that for single sheet csv or spreadsheets, we would allow the # line as well, but that can be discussed later too.

kindly commented 6 years ago

Also I think we should consider supporting HXL in a way that says every sheet has it. i.e by having a Metatab sheet like the following:

# hxl
license CC
version 1.1
extensions

This would mean for every sheet (apart from the metatab) after every heading line there must be the HXL # tag lines. The only thing we would need to discuss is how we represent what flatten-tool needs (the path to the field) within those # tags.

kindly commented 6 years ago

@rory09 I missed the comment about the defaults but they fit into the outline of the plan in my comment.

There will be two types of defaults. Defaults set in flattentool and defaults expressed in the Metatab. To begin with I am going with the defaults set in the Metatab but it should fairly easy to have a way to send a list of defaults to the command line tool. Cove can set a list of defaults for a particular standard that way too.

Bjwebb commented 6 years ago

Our work so far on this is deployed at http://iati.cove.opendataservices.coop/ Only XLSX is supported, but its possible to ignore a whole sheet, or have skipLines or headerRows.

Some example files: https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_defaulted.xlsx?raw=true https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_ignore.xlsx?raw=true https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_in_file.xlsx?raw=true https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_in_metatab.xlsx?raw=true

Next steps are to review this and decide whether CSV support is a priority.

stevieflow commented 6 years ago

Amazing - thanks @Bjwebb

@rory09 - please can you review?

I guess we will also need to plan some documentation around this. @Bjwebb : this could be an IATI CoVE sphnix setup?

Bjwebb commented 6 years ago

Yes, https://github.com/OpenDataServices/cove/issues/814 is the issue about writing docs.

jpmckinney commented 5 years ago

Is this closed by #167 and #216? Not sure what the remaining issues are, if not.