digital-preservation / csv-schema

CSV Schema
http://digital-preservation.github.io/csv-schema
Mozilla Public License 2.0
98 stars 33 forks source link

ISO8601 is more than just YYYY-MM-DD #25

Open afranke opened 5 years ago

afranke commented 5 years ago

YYYY and YYYY-MM should also be accepted as valid for xDate fields.

DavidUnderdown commented 5 years ago

Though they are valid under ISO8601, by my reading they are not valid under XML schema definition of dateTime, which is why we chose the name xDate, to make it clear we were following that definition (and because our use case required a full date), so while I take your point that a wider range of date format options may well be appropriate, I think they need to be new options rather than changing the definition of xDate.

From our point of view as an archive the features of the Extended Date/Time Format which are apparently expected to be included in ISO8601 revision this year are likely to be even more useful as very often our dating is necessarily imprecise.

afranke commented 5 years ago

Though they are valid under ISO8601, by my reading they are not valid under XML schema definition of dateTime, which is why we chose the name xDate

Ok, that’s fair. Although the spec says as a valid XML Schema dateTime data type, it also adds (see [XMLSCHEMA-2] and [ISO8601]) and that made me assume any ISO8601 date should have worked so I didn’t look up the XML schema format. Maybe ISO8601 shouldn’t be referred to here?

while I take your point that a wider range of date format options may well be appropriate, I think they need to be new options rather than changing the definition of xDate.

Sure, I’m fine with any option. ☺

For what it’s worth, in my current use case I need a way to specify that the column can be any of YYYY, YYYY-MM, or YYYY-MM-DD, each row potentially following any of these. I ended up specifying it as regex("^\d{4}(-\d\d(-\d\d)?)?$") which does the job but is not great.

DavidUnderdown commented 5 years ago

It's there mostly because XML Schema 2 also refers to it, and people are broadly familiar is ISO8601 at least meaning year first, but there might be a slightly better way of expressing that.

I think I'd test with regex("^\d{4}(-(0[1-9])|(1[0-2]))?$") or xDate - at least then if it is a fully specified date you'll know it really is a valid date, and you'd only get months in the correct range too, and if there's a sensible range for the year, specify the permitted digits for that a bit more closely too.

Depending where you're getting your data from we've often found it better to specify separate columns for the constituent parts of a date where there may be a mixture of formats, and that also makes it easier to handle invalid dates, such as 31 April or 29 February in a non-leap year. Clerks of the past wrote down impossible dates surprisingly often which is pain when you're handling transcribed data.

DavidUnderdown commented 5 years ago

Actually even better might be eg range(1990-2019) or regex("^\d{4}-(0[1-9])|(1[0-2])$ or xDate")