Closed workergnome closed 6 months ago
I've done this tons of times, in fact I'd be hard pressed to think of a time when I used the humble csv for cheap user-facing data management, and didn't have multi-value fields. For a current example at OKI, we do this extensively on the Open Data Index, where site configuration is powered by Google Sheets.
@rufuspollock ?
@workergnome great point to flag and we've all probably seen this -- be it tags or something else.
I'd imagine we could simply reuse some the csv dialect spec here as an attribute on a given field in a schema: http://specs.frictionlessdata.io/csv-dialect/
I'd strongly invite you (and anyone else) to start by proposing what you'd like to see in this issue.
Then we can start with setting out a pattern as a first step.
Whether we go beyond that to integration into main spec is something we can then review once the pattern is properly road-tested and we have a good sense the extra complexity in the spec is worth the benefit of serving this use case at spec level.
This is very much a common thing, and maybe we should probably support it to some extent, but it breaks the "tidy" tabular model which I think is sort of implicit in Data Packages (I guess "type": "array"
did this too, but at least there was no opportunity to do anything more than to check whether a field was an array or not).
So we should be clear about the use case.
After splitting a value by "delimiter" we can validate each element by its field type if we specify:
"constraints"
otherwise specifiedWhen retrieving data in a row, what is expected? The Python library will fail (rightly) if the value retrieved for a column in a row doesn't match the type. Should it return an array for the field? Or can it somehow be transformed for transport to Pandas, R data frames? (Not sure if possible).
So, this one seems complex to handle nicely.
There are datasets that, plain and simple, have multi-value fields within them.
UNIHAN is an example: http://www.unicode.org/reports/tr38/
Fields which allow multiple values have a Delimiter defined as “space”.
In my situation (https://github.com/cihai/unihan-tabular), I'm trying to get this dataset to squeeze into datapackages.
I offer two options:
Ultimately, my concern is that resorting to an array and object (which in itself needs schema information to be useful) will be a nogo w/ libraries downstream. For instance, there isn't a universal, portable way to handle arrays and objects across databases (unless its stored as blobs, and I'm intent on avoiding that).
But what other way is there to express multi-value data (array) and structured fields (object) with tabular data? IMO by the time array and object is involved, its not tabular data anymore. It wouldn't fit the concepts in the tidy data PDF.
This goes out of the scope of the issue, There actually is a way to handle multi-value (array) fields downstream in libraries like https://github.com/frictionlessdata/jsontableschema-sql-py. You'd break the field's values off into a table with a key linking back to the row. As for other libraries like pandas, there is a concept of "joining" data frames. Lighter tabular python libraries don't understand this concept.. you'd go back to the same problem in the end, the data is not tabular anymore :laugh:
So.. in my situation I still have this one beast of a dataset that just can't be adequately represented in a flat CSV or tabular format.
To add to this—much of the issue that we have, say with CMOA data, is that the relationship is actually many-to-many, so in order to break it out into tables, we'd actually need to create a join table. With fake data:
work_id | title |
---|---|
1 | Starry Night |
2 | Waterlilies |
3 | Fake Collaboration |
person_id | name |
---|---|
1 | Vincent Van Gogh |
2 | Claude Monet |
person_id | work_id |
---|---|
1 | 1 |
2 | 2 |
1 | 3 |
2 | 3 |
This is doable, of course, but it makes the CSVs really, really difficult for someone to use as a CSV. I'd rather have a way to denormalize this data somehow, so it's as compact as possible.
So, the CSV would need a "sub" delimiter. Something like a semicolon (;) for name fields. This would work bi-directionally (importing and exporting the field value).
There's nothing magical about list data, every programming language can split items into an array by delimiter, one way or another. The array spec would need a delimiter option.
Libraries plugging into datapackages like the sql one would break. Libraries intended for purely tabular information with scalar values wouldn't accept list information.
In my circumstances, this kind of data is a niche. But it takes it to the next level:
in most cases, a pure array, such as in @workergnome's situation would do. this could be conveyed in CSV.
In other cases object would fit. But object's themselves need schema information to be useful. Even if libraries didn't support it, it's of value to declare it.
Also, at this point, there's no (trivial) conversion to stay tabular unless prefixing/namespacing columns was introduced. So a field like this:
[{
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
}]
Could translate to columns like kMandarin__zh-Hans
and kMandarin__zh-Han-t
.
On my project, though, I break tabular-ness and just go to JSON/YAML. This still conforms to data packages, albeit loosely.
In rare cases (2 or 3), there is an "array"/list of objects. Such as with this field, "kHanyuPinyin":
| U+5364 | 卤 | 10093.130:xī,lǔ 74609.020:lǔ,xī |
That is, when put into JSON:
[
{
"ucn": "U+5365",
"char": "卥",
"kHanyuPinyin": [
{
"locations": [
"10093.130"
],
"readings": [
"xī",
"lǔ"
]
},
{
"locations": [
"74609.020"
],
"readings": [
"lǔ",
"xī"
]
}
]
}
]
So for that field, it's an array of objects, with an array for each value. I think maybe for 2 and 3 on those examples, I could make an issue, and see if other people have similar circumstances. IMO I think I'm in a niche case until other people come forward with stuff.
@akariv and @pwalsh any thoughts here - esp given connection with #409 and #410 (table type).
@tony @workergnome this is a great input. I'm getting the sense we basically either have a mini-nested CSV again or an actual JSON style object within.
The common case of multi-value cells is not exactly the same as #409 and #410
In my experience, multi-value cells is about a special syntax for delimiting values, and importantly, that syntax is not simply an array
represented as a string.
this:
john; jim
and not this:
["john", "jim"]
multiValue: true
and multiValueDelimiter: ";"
on the field levelDELIMITER
symbolMy suggestion is to amend the array
type, by allowing multiple formats:
format: json
- means that the value is expected to be a valid JSON string.format: delimited
- means that the value is expected to be a delimited string.
When delimited, we will have an extra property delimiter
which will be a string by which we split the original string into its multiple values. Then we split the string by the delimiter and that's it.
In this case, a value such as Radiohead, Portishead, Talking Heads, "Heads, Hands & Feet"
will be parsed incorrectly.format:csv
- we treat the value as a single line CSV with a separate dialect (and the delimiter
becomes part of the dialect). More complicated, but handles cases like above.Note that #409 and #410 are completely orthogonal to this issue - both the itemType
constraint and the new table
type would work just as well with either of the above formats.
I think @akariv suggestion is ripe to become a pattern. @workergnome would you be up for submitting a pattern PR for this?
Apologies, I'm about three years into a new job at this point and unlikely to have time to contribute to this in the near future.
@lauragift21 could you look into working this up as a pattern - i can guide you and its good practice.
Apologies, I'm about three years into a new job at this point and unlikely to have time to contribute to this in the near future.
@workergnome - no problem, thanks for original suggestions and hope the new job is going well!
@rufuspollock Sure I can. I'll take a look and get up to speed with the previous comments.
Thanks, @rufuspollock! Glad to see this moving forward, and will hopefully be able to re-engage once we're further along with our data export infrastructure here at Getty.
I concurr that having multivalued fields in CSV files is quite a common and useful practice when representing straightforward 1-to-n relationships. It would be valuable to me and my colleagues to be able to declare them in a datapackage as right now we rely on constraints pattern to validate them without being able to rely on enums. So if I can help in any way don't hesitate to tell me.
The discussion was continued here - https://github.com/frictionlessdata/specs/issues/736
The support was added in frictionless-py
and we are considering adding it as a part of the specs in v2 (see #736)
See discussion at https://github.com/cmoa/collection/issues/17