frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
498 stars 113 forks source link

Enum constraint with arrays in table schema #549

Closed jungshadow closed 4 years ago

jungshadow commented 6 years ago

I'm using an array field with an enum constraint that consists of an array of strings:

...
{
    "name": "ApplicationRequestStatusType",
    "type": "array",
    "title": "Application Request Status",
    "description": "Specifies the current status of the application. …",
    "constraints": {
        "required": true,
        "enum": [
            "duplicate",
            "invalid",
            "missing-ssn",
            "missing-state-id-number",
            "pending",
            "valid"
        ]
    }
}
...

I’m trying to save a datapackage–using Package–but running into validation errors. The error that’s being thrown is Field "ApplicationRequestStatusType" can't cast value "duplicate" for type "array" with format “default".

From my Gitter conversation with @roll:

@jungshadow It's interesting the specs says that enum constraint is applicable to an array field type. But as an implementator I'm confused here. Now it should work but not the way you expect. Because it uses a general approach (as for other types) every enum item should be an array. And for users I think this behaviour doesn't really make sense if it's not clarified in the specs. @rufuspollock should the specs specify a special approach for treating an enum constraint for arrays/objects? Or probably it should be just a different constraint like constraints.itemEnum? It's also related to the typed arrays discussion

I believe I'm implementing this correctly and it looks like both types of enum constraints (e.g. array of strings and array of arrays) are supported based on the table-schema.json file:

              "constraints": {
                "title": "Constraints",
                "description": "The following constraints apply for `array` fields.",
                "type": "object",
                "properties": {
                  ...
                  },
                  "unique": {
                    ...
                  },
                  "enum": {
                    "oneOf": [
                      {
                        "type": "array",
                        "minItems": 1,
                        "uniqueItems": true,
                        "items": {
                          "type": "string"
                        }
                      },
                      {
                        "type": "array",
                        "minItems": 1,
                        "uniqueItems": true,
                        "items": {
                          "type": "array"
                        }
                      }
                    ]
                  },
                  ...
                }
              },

Is my assessment that my previous snippet is technically correct, accurate? My use case is to constrain an array (prefer a set in this case) to a list of potential values (the strings in the example above). Thanks!

roll commented 6 years ago

@jungshadow So the specs say:

enum - The value of the field must exactly match a value in the enum array`

Based on this words for the array field type the enum constraint should look like:

{
    "name": "ApplicationRequestStatusType",
    "type": "array",
    "title": "Application Request Status",
    "description": "Specifies the current status of the application. …",
    "constraints": {
        "required": true,
        "enum": [
            ["duplicate"],
            ["invalid"],
            ["missing-ssn"],
            ["missing-state-id-number"],
            ["pending"],
            ["valid"]
        ]
    }
}

And valid data values will be:

["duplicate"]
["invalid"]
["missing-ssn"]
["missing-state-id-number"]
["pending"]
["valid"]

Could you share your data package to better understand the use case?

jungshadow commented 6 years ago

Thanks for the update, @roll!

So the specs say:

enum - The value of the field must exactly match a value in the enum array`

I admit though the structure is a bit confusing. First, does the file I mentioned above, table-schema.json in tableschema/profiles/ contradict the spec (assuming I'm reading it correctly)? Second, I can see having the strings wrapped in arrays if you'd like to have specific groups of items as potential values. For example (NB: the following assumes that an application request can be a duplicate, can have a missing SSN and be pending, can be generally invalid, can be missing and SSN and a state identifier, can be pending, and can be valid):

{
    "name": "ApplicationRequestStatusType",
    "type": "array",
    "title": "Application Request Status",
    "description": "Specifies the current status of the application. …",
    "constraints": {
        "required": true,
        "enum": [
            ["duplicate"],
            ["missing-ssn", "pending"],
            ["invalid"],
            ["missing-ssn", "missing-state-id-number"],
            ["pending"],
            ["valid"]
        ]
    }
}

(NB: worth noting that the above scenario isn't indicative of real state policy, just a hypothetical example)

If any combination of the enum items is acceptable, wrapping every string value in an array feels cumbersome, but maybe I'm missing a key component of the thought process.

Could you share your data package to better understand the use case?

Certainly. The datapackage is here and the documentation for the various fields is here.

roll commented 6 years ago

Based on this description the type of the field could be a string. And if it's a string the initial enum will work.

The value of ApplicationRequestStatusType must be one of the following:

duplicate
invalid
mismatch-voter-signature
missing-ssn
missing-state-id-number
missing-voter-signature
pending
valid
other
jungshadow commented 6 years ago

@roll I apologize if I'm not being entirely clear (and it looks like I need to update the description in the documentation, so thanks for pointing that out). I want the ability to capture multiple values from the enum, so the array field type is the most appropriate for my use case, but I don't want to limit to only certain combinations of values. My use case above was only to illustrate how I thought wrapping the enum strings in arrays would work, not that I need it to work as such.

roll commented 6 years ago

@jungshadow No worries. I was also thinking that probably multiple values are needed. Problem that for now the specs doesn't cover this use case. enum is a constraint for field value. And for your use case we need a constraint for field value subitem. So it's probably could be something like constraints.itemEnum.

jungshadow commented 6 years ago

Problem that for now the specs doesn't cover this use case.

Bummer. As a stop-gap, if I wrap all the enum strings as arrays in your first example above, will that accomplish what I'm looking to do? As an example, will the following row validate (see 5 value in)?

'da347b903c8dcb62...','47673a7d20346b72...','','2016-03-14', '2016-03-15','[missing-ssn,missing-voter-signature]','online','untracked','','','2016-10-06','untracked','other','2016-09-16','','mail','','2016-11-08','2016 General Election','55-31000','fips','City Of Green Bay','Wisconsin','United States','military','False'

If so, great! Personally, I'm still advocating for the separation of duties I outlined above (i.e. enum array of strings for any combinations of values and an enum array of arrays for particular combinations of values), but I'm happy with a working solution for now. Thanks for all the help, @roll!

roll commented 6 years ago

@jungshadow No. It will work only for single values in the ApplicationRequestStatusType field. E.g. for the provided row the enum constraint should also contain [missing-ssn,missing-voter-signature].

jungshadow commented 6 years ago

@roll That's a bummer. This project is somewhat of a fact-finding mission. Currently, we're not sure how many or which of the enum values would invalidate a ballot application or a ballot. Is there some way that y'all would consider my use case for inclusion into the spec?

roll commented 6 years ago

@jungshadow As a quick fix on top of my head there is only an idea to store this field a string and use a pattern constraint like:

name: ApplicationRequestStatusType
type: string
constraints:
  pattern: '^((missing-ssn|missing-voter-signature|...)\|?)+$' # not tested

With data like:

ApplicationRequestStatusType
missing-ssn
missing-ssn|missing-voter-signature

Related to the possible specs changes I'm cc here @pwalsh @rufuspollock

jungshadow commented 6 years ago

I appreciate the quick fix, @roll! This isn't a knock on the fix and maybe I'm being overly pedantic, but this functionally seems more like an array than a string. Interested in @pwalsh's and @rufuspollock's thoughts on this, too.

akariv commented 6 years ago

This is a perfect application of the 'itemType' property I suggested here: https://github.com/frictionlessdata/specs/issues/409

So - you'd have an array, and you'll be able to state the inner item type (in this case, a string with an enum constraint).

jungshadow commented 6 years ago

@akariv Sounds like that would work perfectly. I noticed @rufuspollock mentioned it may go into 1.1. Is there a rough timeline for it?

Thinking aloud, I'm still wondering about the usefulness of a set type (or constraint) that's distinct from array. The former would allow only distinct objects while the latter would allow repeated objects. My use case would benefit from the former.

jungshadow commented 6 years ago

Hi @roll, @rufuspollock, and @pwalsh! Wanted to quickly bump this thread. Any ideas on this or a possible timeline for #409, which may similarly work for my purposes? Thanks!

rufuspollock commented 6 years ago

@jungshadow right now we don't have an ETA on new features for 1.1 but my guess is H2 this year (the real constraint is resourcing work on this). What really helps us is:

jungshadow commented 6 years ago

@rufuspollock I completely understand and I'd be happy to do either of the above, but it would be helpful to have some idea on which implementation to focus. Would it be more helpful to keep it narrowly focused on my use case or should I concentrate on the idea in #409?

jungshadow commented 6 years ago

Considering the discussion in frictionlessdata/tableschema-js#152, I wanted to quickly bump this, again. Is #409 the recommended approach or should we carve out a case for this particular issue? /cc @roll @rufuspollock

roshcagra commented 5 years ago

Any update on adding this?

rufuspollock commented 4 years ago

DUPLICATE / MERGING. Closing in favour of #409 since doing that appropriately would resolve this.