frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
493 stars 112 forks source link

Promote the Missing Values per Field pattern to the Table Schema spec? #861

Closed roll closed 7 months ago

roll commented 9 months ago

Overview

Pattern - https://datapackage.org/patterns/missing-values-per-field/

This pattern is already supported by frictionless-py and, in general, is really easy to implement

peterdesmet commented 9 months ago

While probably a useful pattern, I don’t think it will be easy to implement in frictionless-r, since it is not supported in its dependency readr::read_delim() (https://readr.tidyverse.org/reference/read_delim.html), which only supports global missing values.

roll commented 9 months ago

@peterdesmet What do you think in general our strategy should be in cases like this?

Quickly checking the status:

Shall we mark this issue as blocked and create (watch) issues in the backends or promote it to the specs anyway as it's really requested in Python - https://github.com/frictionlessdata/specs/issues/551?


This problem is obviously is broader as there are some features in e.g. Table Dialect that obviously in the same situation e.g. lineTerminator

khusmann commented 9 months ago

In my efforts to implement enumLabels in frictionless-r, I've been creating wrappers for readr that could also be extended to also support field-level missing values. So I think this is pretty feasible for R, although it does require a little extra effort.

I think the larger question here re: implementation compatibility with frictionless features is: To what extent do we incorporate features into frictionless that are not universally available natively across backends, and require implementations to build adapters for?

If we only support the "lowest common denominator" of the features available across implementations, I think that puts us in an overconstrained spot. Instead, I think the spec should try to reflect an encoding for the data that makes sense for the data, independent of the backend / implementation. Because field-level missingness is something that commonly exists in data, I think it should be included in the spec, and then it should be up to the implementation (e.g. frictionless-py, frictionless-r, etc.) to do its best to make those data available in whatever form the backend (e.g. petl, polars, readr) supports. (Or throw an informative error if the feature has not been implemented or adapted for that backend, or throw a warning if information is being lost in the conversion).

The challenge with this, of course, is it means we'd have to keep track of support matrices for frictionless props across implementations (I'm imagining something like browser compatibility lists in web standards).

peterdesmet commented 8 months ago

I was hesitant about supporting this (since it’s not straightforward in R), but then I realized frictionless-r already doesn’t support all options when reading data (search for “not supported” in https://docs.ropensci.org/frictionless/reference/read_resource.html). So it think it is ok to add properties without (immediate) support, if they have real use cases and are carefully considered. I can’t assess if that is the case here.

I think a compatibility list is a good idea.