Add option: choose which arrays should be expanded into tables

rhiaro commented 5 years ago

When flattening, multiple sheets/csv files might be generated if there are nested objects in the JSON. The user doesn't always want extra sheets, but in some cases may want the objects all in the main sheet with their full JSON path as column headings, like what rollup does if you include it in a schema.

"Roll up" columns from subsheets into the main sheet if they are specified in a rollUp attribute in the schema.

(rollup docs)

Use cases I know of so far are PPP (one contract) and Dhangadhi (one award, supplier and contract per contracting process) so they can all go in the same sheet.

Following discussions with @kindly and @duncandewhurst, implementation outline:

User input is a file with a list of paths or a single path on the commandline.
Extends existing rollup argument to pass files or paths directly, so the schema doesn't need to be modified.
Only works if there is one object in the array. ie. we will not have duplicate data across multiple rows in the output (consistent with current rollup behaviour).
Column headings will not have array indices in them.

@jpmckinney does that sound like it meets current needs?

jpmckinney commented 5 years ago

Yes!

There are use cases for rolling up multi-item arrays (i.e. preserve array indices in column headings) but these have not been confirmed by actual users.

Sent from mobile

On May 30, 2019, at 3:52 PM, Amy Guy notifications@github.com wrote:

When flattening, multiple sheets/csv files might be generated if there are nested objects in the JSON. The user doesn't always want extra sheets, but in some cases may want the objects all in the main sheet with their full JSON path as column headings, like what rollup does if you include it in a schema.

"Roll up" columns from subsheets into the main sheet if they are specified in a rollUp attribute in the schema.

(rollup docs)

Use cases I know of so far are PPP (one contract) and Dhangadhi (one award, supplier and contract per contracting process) so they can all go in the same sheet.

Following discussions with @kindly and @duncandewhurst, implementation outline:

User input is a file with a list of paths or a single path on the commandline. Extends existing rollup argument to pass files or paths directly, so the schema doesn't need to be modified. Only works if there is one object in the array. ie. we will not have duplicate data across multiple rows in the output (consistent with current rollup behaviour). Column headings will not have array indices in them. @jpmckinney does that sound like it meets current needs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rhiaro commented 5 years ago

Implementation so far works as follows:

--rollup can be passed with no argument, with one or more JSON path strings as arguments, or with a line-separated file of JSON strings attached.
If there's no argument and no schema, or if there is no argument and a schema but the schema has no rollUp attribute, flattening succeeds but no rolling up takes place. (TODO: add a warning)
If there's an argument and a schema with rollUp, the rollup fields from the schema take precedence; the direct or file inputs are ignored.
If --use-titles is also passed but there's no schema, there's a warning, and flattening succeeds with the JSON paths as column headings.

The default behaviour of flattening is to include array indices in the column headings (meaning output from flattening can be unflattened again out of the box). Since rollup only works when there's one item, the column headings are always parent/0/child, and the initial spec for rollup said column headings should not have array indices in them. But stripping the /0/ out would be inconsistent, and I think result in a different unflattened output. So I think now we should keep them in.

jpmckinney commented 5 years ago

Sounds good! And agreed to keep the /0/ in. For the third bullet ("If there's an argument and a schema with rollUp, the rollup fields from the schema take precedence; the direct or file inputs are ignored."), I think a warning should be issued if inputs are ignored.

rhiaro commented 5 years ago

Is it okay for rollup input to be restricted to properties on the root object? Eg. if the rolled up object has 5 of its own fields, they will all be included in the main sheet, you can't specify only specific fields of the rolled up object to include by passing them to rollup. That is (in the testing cafe use case), you can pass dishes to get all of the properties of the dish, but not dishes/title to get only the name of the dish included.

I think we don't need to add the possibility to pass more specific JSON paths to rollup because it can be combined with preserve-fields or filter-* to remove unwanted fields at a lower level. You'd pass --rollup=dishes and preserve-fields={a file containing dishes/title} to roll up only the name of the dish.

jpmckinney commented 5 years ago

That makes sense. Let's consider a different scenario:

Let's say there's a dataset where every contract has at most one related process. So, we want to roll up the related process into the contracts sheet. Does the current roll-up feature allow that?

This is an example where the property to roll-up isn't on the root (release) object.

rhiaro commented 5 years ago

The original rollup feature has never worked on non-root objects, it is only possible to rollup into the main sheet, so unfortunately the extended rollup feature also can't do this.

It will be quite challenging to implement, so is it okay if we log this in an issue for future work? (We can pick it up in the new process if it's a priority.)

jpmckinney commented 5 years ago

Sounds good! Please create the new issue.

robredpath commented 5 years ago

flatten-tool - Add option: choose which arrays should be expanded into tables

robredpath commented 5 years ago

@rhiaro the associated Trello card is archived - can this be closed now?

OpenDataServices / flatten-tool

Add option: choose which arrays should be expanded into tables #301