Pick up column titles from the nth line

timgdavies commented 8 years ago

In a spreadsheet it is sometimes useful to provide information above the column titles.

For example:

(1) To provide meta-data


License	CC-BY
Downloaded from	http://www.example.com/
Attributions	Example Coop, Ordnance Survey

id	grant_title	grantee/name
01	Test Grant	AnyCharity

(2) To provide additional field documentation

Identifier	Grant Title	Recipient
Identifiers should be unique	Titles should be no more than 10 words long, and easy to understand for non-experts	Include the name the recipient is known as, their legal name can be included elsewhere
id	grant_title	grantee/name
01	Test Grant	AnyCharity

If, like HXL, we allow that the column headers can be given in any row (or any of the first N rows), and that only data below the column header row will be used, we could build much more flexible spreadsheets.

For example, the field title row could be hidden, but there to mark-up an existing tabular data sheet.

We have discussed whether this behaviour should be based on:

Specifying the row to start from
Auto-detection of header row

My preference is for auto-detection.

We discussed auto-detection being an option that needs to be explicitly turned on.

andylolz commented 7 years ago

Ooh, +1 for doing it the HXL way! This is particularly handy where there are deeply nested elements, so the column headers required by flatten-tool are not very human readable.

dalepotter commented 7 years ago

+1 for HXL too.

And like a field containing a short documentation string also sounds useful.

stevieflow commented 7 years ago

+1

johnadamsDFID commented 7 years ago

+1 Excellent idea.

dalepotter commented 7 years ago

As part of the iati.core python library (which aims to better support the the way IATI developers work with the IATI Standard and IATI data), I'm currently working on a user story which may be relevant to the related issue of outputting docs relating to each IATI XPath element.

As a IATI Tech Team Developer I want to build documentation sites that automatically extract definitions and such like from the Schemas and related places, so that I don’t need to write documentation from scratch and manually update it each time something changes.

We're imagining that implementation might be done by something like iati.core.Schema.get_documentation_string(xpath) This would return the English language documentation string.

Happy to provide more detail on the possible implementation of this, if this might be helpful...?

stevieflow commented 7 years ago

Some additional insight on this

@rory09 and I recently developed a template for gathering IATI relevant data from across several country offices of an organisation.

One aspect that helped users (we found) was to keep the xpath details at the top of the document, and slightly shaded out (to imply that they need not be touched!):

sheet

We found this worked well, in terms of results back. To be clear - we didnt provide a template with the path immediately above data cells, but this might be a story worth considering in light of the above

that only data below the column header row will be used

IN our model here, this would represent a challenge

timgdavies commented 7 years ago

It might be useful to think about the options that Open Refine gives when uploading a file

I.e. in this case we want to:

Parse 1st row as header
Drop first two rows of the data
Parse everything else

But in other cases might want the other combination of options.

Another design pattern to consider might be finding a way to mark rows in the first column as not-to-parse (e.g. hidden column A with # in for 'comment columns')

stevieflow commented 7 years ago

Yes In discussion with @Bjwebb we talked about utilising the meta tab to try and (manually) indicate this. This might be useful in terms of us trying different approaches, rather than trying (initially) to build features of UI, etc...

kindly commented 7 years ago

@Bjwebb and I had a further discussion about this. We have excluded auto-detection as an option as we think there are too many things that could go wrong with this. We have 3 options remaining, summarising the above and our discussion.

1. Saying how many (and what) lines to skip in the Metatab.

This means we will will have some kind of special options in the metatab that say how many rows to skip or if we skip row(s) after the heading line.

Pros

Easy to implement
No need for any information on individual pages

Cons

Can't configure each page and so awkward for single CSV pages
Choosing variable name is difficult as it should not clash with any fieldname actually in the data.

2. Let the first column in the spreadsheet be a comment to show the line is skipped.

For example:

#
#	can put anything here
#	Some	Human	Readable	Title
actual	heading	line	here
data	starts	from	here

Pros

Fairly easy to implement (some complications around how we recognise if the first column can have a hash in it).
Very flexible skipping of lines can vary per sheet.

Cons

Odd behaviour (skipping first column) for spreadsheet user or data consumer. More common to skip the first few rows.

3. Let the first line be for configuration starting with a #.

If the first cell A1 is a # the rest of the first line is for configuring the sheet. We can have various configuration options on this line i.e ignore sheet, skip lines, skip after headings, vertical orientation.

For example:

#	skip-line	skip-afterheader
can put anything here
actual	heading	line	here
Some	Human	Readable	Title
data	starts	from	here

Pros

Very flexible and can add lots of commands.
Future proof as can add more options in future.

Cons

Not so easy to know what to put in first line, as non standard way of expressing options.

timgdavies commented 7 years ago

Is there an option 4?

4. Skip the row when the first character of the row is # and use the first non-comment row as the header

#
# can put anything here
# Some	Human	Readable	Title
actual	heading	line	here
data	starts	from	here
# I can also put a comment in the last line	-	-	-
# E.g. the downloaded date	-	-	-

Pros

Gets out of the problem of a blank first column;

Cons

Causes problems if you have data where column 1 includes # as the first character

kindly commented 7 years ago

@timgdavies we excluded options where the "data" not just the "headings" has limits to what can expressed. So 4 was excluded for that reason. We could have it that you are allowed comments continuously at the top but not again after that (i.e in the footer), but this means you would not be able to have the human readable headings after the normal headings. Also having # at the start of only the first human readable heading seems a bit odd.

My preference is for 3, even though it the most painful to implement.

We can have a way for the metatab to define the defaults for all sheets in its top hash line.

Common cases for this # line would be "#, ignore" or "#, vertical". Also, it solves the issue of how express metadata at the top (first point on this ticket) by having "#, metadata-lines-3", so we know the first three lines are lines of metadata explicitly. This way the metadata can parsed (not just skipped), which solves the metatab issue for a single csv file.

I suppose it is fairly ugly but instead of comments its a bit like a #! line at the top of files in order to know how to parse them.

timgdavies commented 7 years ago

Ok - you're selling me on option 3.

I've just been looking at:

W3C Tabular data dialect descriptions which provide a set of possible 'commands' (skipRows)

http://w3c.github.io/csvw/syntax/#tabular-data-embedding-annotations is also particularly useful as a reference source for us I think.

kindly commented 7 years ago

@timgdavies They look good references especially for naming conventions. i.e

#	skipLines 1	headerRows 2
can put anything here
actual	heading	line	here
Some	Human	Readable	Title
data	starts	from	here

I suppose we could also allow any number of continuous # lines at the start (like option 4) for comments, but keep the first line special for "commands".

stevieflow commented 7 years ago

My thinking is that on the example I provided above: https://github.com/OpenDataServices/flatten-tool/issues/130#issuecomment-316770891 - there would need to be another row for configuration. With styling and such, we can make it clear to the user to not touch this - but it could be a risk, but minor

@rory09 - any thoughts?

kindly commented 7 years ago

@stevieflow Just for clarity I hope that you will not need to have that line at the top of every sheet. The aim is that you can define the defaults for all sheets in the Metatab sheet.

rorads commented 7 years ago

I have a few thoughts on the issues mentioned above. I'm going to post the thought which is well formed now, and then have a think about the multiple heading rows discussion.

Commenting out columns and workbook sheets

I've often (and often seen publishers) put an extra column in my data for notes/quick tests using formulas, or put a whole extra sheet in some IATI/360 data to house values for calculations, pivot tables for quick reference, or even charts for overviewing data.

See the red tabs in this image:

For either a column or tab which a user doesn't want to convert, but which they don't want to delete every time they convert their data, I propose the user should be able to 'comment' out that column or sheet with by adding two underscores at the start, so "Org ID Index" would become "__Org ID Index" and a notes column would be "__notes".

N.B. I would have gone with a hashtag, but Tim rightly pointed out this would run counter to the precedent set by HXL, possibly resulting in confusion.

rorads commented 7 years ago

I can see the appeal of option 3, but I have some worries.

Convention inconsistencies between CSV and Excel files

IATI version must be be declared in the metatab in Excel, but there can't be a metatab for CSV files, so there would need to be very clear documentation, and even then it could be confusing / would mean that people would have to be careful in transferring data from csv to excel.

Users/publishers having to program their 2d data

The runaway success of HXL comes from its ability to take existing, easily consumed datasets and simply allowing a publisher to ad a new row in a predictable place and as long as they do that one step correctly, and as long as their data makes sense, they'll produce useful data.

This isn't to say that I think the existence of a command row at the top isn't a good idea - for superusers like myself it's great, but I think we should be very opinionated and somewhat protective of our users in setting very clear and canonical defaults, which allow the 90% of users who just want to keep their headings but add one row below/above them like in HXL.

One of the huge benefits of flatten-tool in my use is that it can be used to convert lightly marked up data, which makes sense as stand alone. Although it's not hard to apply an auto-filter or pivot table to a custom range, for example, it creates a small amount of friction which could stop people exploring the really powerful features of excel and/or IATI.

rorads commented 7 years ago

Wrong button. ![oh dear](facepalm.png)

rorads commented 7 years ago

So with my above call for defaults in mind, we could present users with the following kind of template in the case of Excel (with a meta-tab + IATI version included elsewhere in the workbook):

	A	B	C	D	E
1	Activity ID	Publisher ID	Publisher Type Code	Publisher Name	...
2	iati-identifier	reporting-org/@ref	reporting-org/@type	reporting-org/narrative	...
3	GB-GOV-3-12345	GB-GOV-3	10	UK - Foreign & Commonwealth Office	...
4	GB-GOV-3-12345	GB-GOV-3	10	UK - Foreign & Commonwealth Office	...
5	GB-GOV-3-12345	GB-GOV-3	10	UK - Foreign & Commonwealth Office	...

Then, for a CSV, it could be something like:

#	version 2.02	skip_rows 1	header_rows 2
Activity ID	Publisher ID	Publisher Type Code	Publisher Name	...
iati-identifier	reporting-org/@ref	reporting-org/@type	reporting-org/narrative	...
GB-GOV-3-12345	GB-GOV-3	10	UK - Foreign & Commonwealth Office	...
GB-GOV-3-12345	GB-GOV-3	10	UK - Foreign & Commonwealth Office	...
GB-GOV-3-12345	GB-GOV-3	10	UK - Foreign & Commonwealth Office	...

kindly commented 7 years ago

@rory09

Thanks for your feedback. I have had the same concerns over auto-filtering, pivoting and other functions. HXL has this issue too.

I am happy to support HXL but the standard itself "as far as I can see" does not give us a clear way to say if the sheet is HXL formatted to begin with and flattentool has to know that. Any autodetection we do will be flawed and would be have to be replicated by anyone else consuming the spreadsheets.

I want to start working on this and there is not really a consensus on a solution.

For the time being I think I will go for solution 1 initially but with the syntax discussed in solution 3:

This way we would only allow the # line at the top of the Metatab page and anything configured there applies to all (non-metatab) sheets. Hardly anyone apart from the person making the spreadsheet template would be likely to touch that line. So a metatab page would look like:

#	skipLines 1	headerRows 2
license	CC
version	1.1
extensions

This keeps all the "data" sheets clean (no # at the top) and uniform. This makes for easier data manipulation of those sheets. It does limit us to not having per sheet formatting but that can be added later if it is seen as a need. Next we could have an exception that for single sheet csv or spreadsheets, we would allow the # line as well, but that can be discussed later too.

kindly commented 7 years ago

Also I think we should consider supporting HXL in a way that says every sheet has it. i.e by having a Metatab sheet like the following:

#	hxl
license	CC
version	1.1
extensions

This would mean for every sheet (apart from the metatab) after every heading line there must be the HXL # tag lines. The only thing we would need to discuss is how we represent what flatten-tool needs (the path to the field) within those # tags.

kindly commented 7 years ago

@rory09 I missed the comment about the defaults but they fit into the outline of the plan in my comment.

There will be two types of defaults. Defaults set in flattentool and defaults expressed in the Metatab. To begin with I am going with the defaults set in the Metatab but it should fairly easy to have a way to send a list of defaults to the command line tool. Cove can set a list of defaults for a particular standard that way too.

Bjwebb commented 7 years ago

Our work so far on this is deployed at http://iati.cove.opendataservices.coop/ Only XLSX is supported, but its possible to ignore a whole sheet, or have skipLines or headerRows.

Some example files: https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_defaulted.xlsx?raw=true https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_ignore.xlsx?raw=true https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_in_file.xlsx?raw=true https://github.com/OpenDataServices/flatten-tool/blob/master/flattentool/tests/fixtures/xlsx/commands_in_metatab.xlsx?raw=true

Next steps are to review this and decide whether CSV support is a priority.

stevieflow commented 7 years ago

Amazing - thanks @Bjwebb

@rory09 - please can you review?

I guess we will also need to plan some documentation around this. @Bjwebb : this could be an IATI CoVE sphnix setup?

Bjwebb commented 7 years ago

Yes, https://github.com/OpenDataServices/cove/issues/814 is the issue about writing docs.

jpmckinney commented 6 years ago

Is this closed by #167 and #216? Not sure what the remaining issues are, if not.

OpenDataServices / flatten-tool