digital-preservation / csv-schema

CSV Schema
http://digital-preservation.github.io/csv-schema
Mozilla Public License 2.0
100 stars 33 forks source link

Ability to make column headers optional #16

Closed lightswitch05 closed 2 years ago

lightswitch05 commented 7 years ago

It appears the @optional directive allows the values for a column to be empty - but there doesn't appear to be a way to make the entire column header optional.

This feature would be very useful when there are many subsets of CSV's. With the ability to define an entire column as optional, you could then create a single superset schema that would validate each subset of available columns.

For example: I might initially have a v1 CSV defined as:

version 1.1
"First Name":
"Last Name":

Later, its determined we would like to receive more information (v2), but to be backwards compatible (v1) CSV's are still accepted:

version 1.1
"First Name":
"Last Name":
"Middle Name": @optional

Now, I have a mixture of CSV's - and not a single schema that can validate them all. v1 CSV's will fail v2 validation since the "Middle Name" column is not defined regardless of it being optional. v2 CSV's will fail the v1 schema since it has an extra unknown column.

Proposed solution:

version 1.2
"First Name":
"Last Name":
"Middle Name": @optionalColumn

Making the entire column definition optional allows a single schema to validate both v1 and v2 of my CSVs

DavidUnderdown commented 7 years ago

I can see how this would be useful, it would need a bit of thinking as to how to fit it in with the global directives such as @totalColumns - we might also want a new @maxColumns there to cater for this scenario.

As you'll see I've tagged this for 1.2, but we don't have resource right now to work on that actively. If you want to have a go at working up an addition to the draft 1.2 spec we're happy to receive pull requests.

lightswitch05 commented 7 years ago

implemented in #17

adamretter commented 7 years ago

I don't think @OptionalColums can be implemented as it creates an ambiguity in the parser

lightswitch05 commented 7 years ago

@adamretter The only way it is different than @optional is by not requiring the column header for an otherwise empty column

adamretter commented 7 years ago

Yes I understand. However it is still ambiguous

lightswitch05 commented 7 years ago

I have to disagree. Ambiguity is introduced by poorly defined requirements. What is ambiguous about a column being optional? In a schema, if you mark a column as optional, then you accept the fact that your output may not have that column defined. If it is defined, all other validations apply. Since the goal of this project is define a CSV schema, let's take a look at some other schema projects.

JSON schema

JSON Schema defaults everything to be optional unless it is specifically marked as required.

Example JSON Schema

{
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        },
        "middleName": {
            "type": "string"
        }
    },
    "required": [
        "lastName",
        "firstName"
    ],
    "type": "object"
}

Valid JSON

{
  "firstName": "John",
  "lastName": "Doe" 
},
{
  "firstName": "Donald",
  "lastName": "Trump",
  "middleName": "John"
}

Invalid JSON

{
  "firstName": "John",
  "lastName": "Doe",
  "middleName": null
}

Please note how the JSON schema allows "middleName" to not be provided - but does NOT allow it to be null. This type of validation is not possible in CSV Schema without @optionalColumn support

XML schema (XSD)

XSD uses minOccurs=0 to define optional elements

Example XSD

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="person">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="firstName" type="xs:string"/>
                <xs:element name="lastName" type="xs:string"/>
                <xs:element name="middleName" minOccurs="0" type="NonEmptyString"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:simpleType name="NonEmptyString">
        <xs:restriction base="xs:string">
            <xs:minLength value="1"/>
        </xs:restriction>
    </xs:simpleType>
</xs:schema>

Valid XML

<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName> <!-- middleName is not provided, but it is still valid -->
</person>
<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName>
    <middleName>John</middleName>
</person>

Invalid XML

<person>
    <firstName>Donald</firstName>
    <lastName>Trump</lastName>
   <middleName /> <-- If the element is provided, but must not be empty -->
</person>

SQL

SQL is probably the most apt comparison. All columns are null by default. You are not required to insert every column on every insert:

Example SQL table

CREATE TABLE people (
    firstName varchar(255) NOT NULL,
    lastName varchar(255) NOT NULL,
    middleName varchar(255) CHECK (middleName <> '') -- allow null but not empty
);

Valid Inserts

INSERT INTO people (firstName, lastName) values ('Donald', 'Trump');
INSERT INTO people (firstName, lastName) values ('Donald', 'Trump', null);
INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', 'John');

Invalid Inserts

INSERT INTO people (firstName, lastName, middleName) values ('Donald', 'Trump', '');

Closing comments

Finally, I recognize this project's goal is to define a CSV as strictly as possible. Providing @optionalColumn does not take away from that goal. The schema definitions provide for some extremely concise language, which is great, but that doesn't make using something as generic as @notEmpty ambiguous. It is perfectly defined as notEmpty. The same can be said for @optionalColumn.

@optionalColumn gives users the ability to define 2 different CSV that are functionally the same with the same schema. Why can I not represent these two files using a single schema?

"firstName","lastName","middleName"
"Donald","Trump",""
"firstName","lastName"
"Donald","Trump"
adamretter commented 7 years ago

@lightswitch05 The ambiguity comes from matching the columns in the CSV Schema to the columns in the CSV file. If you have one @OptionalColumn, that might work (but only in some cases), but if you have more than one then it likely becomes impossible.

This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.

I think it might help, if I try and explain how the implementation works, and therefore demonstrate why I think @OptionalColumn is unimplementable:

Consider the following Schema:

version 1.2
1:
2:
3: @optionalColumn
4: @optionalColumn
5:
6: @optionalColumn
7:

When the implementation has to match this against a CSV file, Column 5, 6, and 7 are now ambiguous. Actually anything after (3) is ambiguous, but I think this example pushes the point further!...

The processor has no way of knowing under what circumstances (5) should match what columns. (5) could match CSV column (3), (4), or (5) depending on the format of the CSV. You might even have a CSV, where you think CSV Schema Column (5) is matching CSV Column (4) but due to data errors (which the CSV validator would not highlight in this case) it is actually matching CSV Column (3). Therefore you are likely to introduce subtle data validation errors that were not possible before.

I could see that you might argue that, you should only use @optionalColumn with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).

If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.

Let me know if that makes sense to you. If not I can try and explain further...

lightswitch05 commented 7 years ago

This has nothing to do with JSON, XML or SQL. Please keep in mind that with those formats the structure of the data is known in advance. With CSV the structure is unknown.

This is the point of a schema, to validate the structure - which could be particularly useful for CSVs since the structure has so many unknowns.

I could see that you might argue that, you should only use @optionalColumn with named columns in the CSV Schema. However, I would hear point out that there is nothing in the CSV Schema spec that says the column labels in the CSV Schema MUST match the CSV file; and for good reason, as you can in fact use named columns with CSVs that have no header row (using names instead of numbers here helps improve the clarity of any validation error messages to the user).

I agree that @optionalColumn would be impossible to support without named headers. However, there are already existing directives that have mutual exclusivity. In fact, @noHeader is itself mutually exclusive to the @ignoreColumnNameCase directive. I don't believe it is an unreasonable request for an @optionalColumn directive, making it mutually exclusive to the @noHeader directive.

If you have the case where you want to have many similar CSV Schemas, I would simply suggest creating your super-data-model in something more flexible that CSV Schema, e.g. XML (or RDF), you could then easily run some very simple XSLT to generate the CSV Schemas you need on demand.

I would not be requesting this feature if I had any control over the incoming CSVs. If csv-schema is only targeting usage by those with complete control over their files, then it useless to me. I am expecting hundreds of different CSV formats- many of which contain optional columns that may or may not be supplied in the CSV. If your solution is to have a schema defined for each possible combination then the total number of schemas required to validate the incoming files quickly becomes unmanageable.

adamretter commented 7 years ago

I agree that @optionalColumn would be impossible to support without named headers.

The point is the named headers in the CSV Schema are just labels, they do not infer a match against the header in the CSV file itself. That is important for the flexibility that the CSV Schema currently offers.

If you wanted to change it so that headers in the CSV Schema had to match headers in the CSV file, then you would need to do some more significant work first. After such a change, if it was acceptable to all parties, then you could consider introducing a @optionalColumn directive. However, with the current spec and how parsing is executed and validated according to the CSV Schema, an @optionalColumn is impossible without introducing ambiguity.

I would not be requesting this feature if I had any control over the incoming CSVs

You don't need control over the incoming CSVs. You only need control over generating the CSV Schemas, which I imagine must be within your remit ;-)

lightswitch05 commented 7 years ago

So the reference implementation is the blocker here and not the spec itself?

lightswitch05 commented 7 years ago

Dynamic CSV schema generation would be very problematic as the schemas themselves would be submitted by the users.

DavidUnderdown commented 7 years ago

@adamretter as of this commit b6dd4a7b0e23d3c9515864e2a2375b6bcd42c9f2 we do check header fields against the rule names in CSV Validator (though agreed it's not explicit in the schema definition at the moment). Not doing so was causing us problems downstream, and seems counter-intertuitive.

lightswitch05 commented 7 years ago

I just stumbled on digital-preservation/csv-validator/issues/134 - looks like our data is a little too wild for this schema definition and we'll have to find an alternative solution. Thank you for your time discussing the merits of optional columns

adamretter commented 7 years ago

@DavidUnderdown Yes I did see that in the reference implementation. However the spec does not say that MUST be the case, and for good reason (see above) ;-)

DavidUnderdown commented 2 years ago

Closed as per above discussion