frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 109 forks source link

There should be a way to indicate delimiters for multi-value cells. #381

Closed workergnome closed 6 months ago

workergnome commented 7 years ago

See discussion at https://github.com/cmoa/collection/issues/17

pwalsh commented 7 years ago

I've done this tons of times, in fact I'd be hard pressed to think of a time when I used the humble csv for cheap user-facing data management, and didn't have multi-value fields. For a current example at OKI, we do this extensively on the Open Data Index, where site configuration is powered by Google Sheets.

@rufuspollock ?

rufuspollock commented 7 years ago

@workergnome great point to flag and we've all probably seen this -- be it tags or something else.

I'd imagine we could simply reuse some the csv dialect spec here as an attribute on a given field in a schema: http://specs.frictionlessdata.io/csv-dialect/

I'd strongly invite you (and anyone else) to start by proposing what you'd like to see in this issue.

Then we can start with setting out a pattern as a first step.

Whether we go beyond that to integration into main spec is something we can then review once the pattern is properly road-tested and we have a good sense the extra complexity in the spec is worth the benefit of serving this use case at spec level.

danfowler commented 7 years ago

This is very much a common thing, and maybe we should probably support it to some extent, but it breaks the "tidy" tabular model which I think is sort of implicit in Data Packages (I guess "type": "array" did this too, but at least there was no opportunity to do anything more than to check whether a field was an array or not).

So we should be clear about the use case.

Validation

After splitting a value by "delimiter" we can validate each element by its field type if we specify:

Retrieval/Transformation

When retrieving data in a row, what is expected? The Python library will fail (rightly) if the value retrieved for a column in a row doesn't match the type. Should it return an array for the field? Or can it somehow be transformed for transport to Pandas, R data frames? (Not sure if possible).

So, this one seems complex to handle nicely.

tony commented 7 years ago

There are datasets that, plain and simple, have multi-value fields within them.

UNIHAN is an example: http://www.unicode.org/reports/tr38/

Fields which allow multiple values have a Delimiter defined as “space”.

In my situation (https://github.com/cihai/unihan-tabular), I'm trying to get this dataset to squeeze into datapackages.

I offer two options:

  1. a purely tabular output that keeps multiple value fields pristine, leaving it up to the user split them up
  2. then a "structured" export when I split up fields for the user. But this requires using arrays and objects.

Ultimately, my concern is that resorting to an array and object (which in itself needs schema information to be useful) will be a nogo w/ libraries downstream. For instance, there isn't a universal, portable way to handle arrays and objects across databases (unless its stored as blobs, and I'm intent on avoiding that).

tony commented 7 years ago

But what other way is there to express multi-value data (array) and structured fields (object) with tabular data? IMO by the time array and object is involved, its not tabular data anymore. It wouldn't fit the concepts in the tidy data PDF.

This goes out of the scope of the issue, There actually is a way to handle multi-value (array) fields downstream in libraries like https://github.com/frictionlessdata/jsontableschema-sql-py. You'd break the field's values off into a table with a key linking back to the row. As for other libraries like pandas, there is a concept of "joining" data frames. Lighter tabular python libraries don't understand this concept.. you'd go back to the same problem in the end, the data is not tabular anymore :laugh:

So.. in my situation I still have this one beast of a dataset that just can't be adequately represented in a flat CSV or tabular format.

workergnome commented 7 years ago

To add to this—much of the issue that we have, say with CMOA data, is that the relationship is actually many-to-many, so in order to break it out into tables, we'd actually need to create a join table. With fake data:

work_id title
1 Starry Night
2 Waterlilies
3 Fake Collaboration
person_id name
1 Vincent Van Gogh
2 Claude Monet
person_id work_id
1 1
2 2
1 3
2 3

This is doable, of course, but it makes the CSVs really, really difficult for someone to use as a CSV. I'd rather have a way to denormalize this data somehow, so it's as compact as possible.

tony commented 7 years ago

So, the CSV would need a "sub" delimiter. Something like a semicolon (;) for name fields. This would work bi-directionally (importing and exporting the field value).

There's nothing magical about list data, every programming language can split items into an array by delimiter, one way or another. The array spec would need a delimiter option.

Libraries plugging into datapackages like the sql one would break. Libraries intended for purely tabular information with scalar values wouldn't accept list information.

In my circumstances, this kind of data is a niche. But it takes it to the next level:

  1. in most cases, a pure array, such as in @workergnome's situation would do. this could be conveyed in CSV.

  2. In other cases object would fit. But object's themselves need schema information to be useful. Even if libraries didn't support it, it's of value to declare it.

    Also, at this point, there's no (trivial) conversion to stay tabular unless prefixing/namespacing columns was introduced. So a field like this:

    [{
      "kMandarin": {
         "zh-Hans": "qiū",
         "zh-Hant": "qiū"
       }
     }]

    Could translate to columns like kMandarin__zh-Hans and kMandarin__zh-Han-t.

    On my project, though, I break tabular-ness and just go to JSON/YAML. This still conforms to data packages, albeit loosely.

  3. In rare cases (2 or 3), there is an "array"/list of objects. Such as with this field, "kHanyuPinyin":

    | U+5364 | 卤 | 10093.130:xī,lǔ 74609.020:lǔ,xī |

    That is, when put into JSON:

    [
      {
        "ucn": "U+5365",
        "char": "卥",
        "kHanyuPinyin": [
          {
            "locations": [
              "10093.130"
            ],
            "readings": [
              "xī",
              "lǔ"
            ]
          },
          {
            "locations": [
              "74609.020"
            ],
            "readings": [
              "lǔ",
              "xī"
            ]
          }
        ]
      }
    ]

    So for that field, it's an array of objects, with an array for each value. I think maybe for 2 and 3 on those examples, I could make an issue, and see if other people have similar circumstances. IMO I think I'm in a niche case until other people come forward with stuff.

rufuspollock commented 7 years ago

@akariv and @pwalsh any thoughts here - esp given connection with #409 and #410 (table type).

@tony @workergnome this is a great input. I'm getting the sense we basically either have a mini-nested CSV again or an actual JSON style object within.

pwalsh commented 7 years ago

The common case of multi-value cells is not exactly the same as #409 and #410

In my experience, multi-value cells is about a special syntax for delimiting values, and importantly, that syntax is not simply an array represented as a string.

this:

john; jim

and not this:

["john", "jim"]

akariv commented 7 years ago

My suggestion is to amend the array type, by allowing multiple formats:

Note that #409 and #410 are completely orthogonal to this issue - both the itemType constraint and the new table type would work just as well with either of the above formats.

rufuspollock commented 4 years ago

I think @akariv suggestion is ripe to become a pattern. @workergnome would you be up for submitting a pattern PR for this?

workergnome commented 4 years ago

Apologies, I'm about three years into a new job at this point and unlikely to have time to contribute to this in the near future.

rufuspollock commented 4 years ago

@lauragift21 could you look into working this up as a pattern - i can guide you and its good practice.

Apologies, I'm about three years into a new job at this point and unlikely to have time to contribute to this in the near future.

@workergnome - no problem, thanks for original suggestions and hope the new job is going well!

lauragift21 commented 4 years ago

@rufuspollock Sure I can. I'll take a look and get up to speed with the previous comments.

workergnome commented 4 years ago

Thanks, @rufuspollock! Glad to see this moving forward, and will hopefully be able to re-engage once we're further along with our data export infrastructure here at Getty.

Yomguithereal commented 3 years ago

I concurr that having multivalued fields in CSV files is quite a common and useful practice when representing straightforward 1-to-n relationships. It would be valuable to me and my colleagues to be able to declare them in a datapackage as right now we rely on constraints pattern to validate them without being able to rely on enums. So if I can help in any way don't hesitate to tell me.

roll commented 6 months ago

The discussion was continued here - https://github.com/frictionlessdata/specs/issues/736

The support was added in frictionless-py and we are considering adding it as a part of the specs in v2 (see #736)