EDIorg / ECC

ECC = EML Congruence Checker
5 stars 0 forks source link

may we add truncated date representations to dateTimeFormatString_list.csv #32

Closed srearl closed 2 years ago

srearl commented 2 years ago

@clnsmth - I have run into a situation where an investigator has submitted a dataset that includes a month field (as MM). This is not a supported dataTimeFormat in PASTA+ but does seem to be included in the ISO 8601 standard (scroll to "Truncated representations" at https://en.wikipedia.org/wiki/ISO_8601). In my case, the format for month would be --MM. Do you think that we could add these truncated representations to https://github.com/EDIorg/ECC/blob/master/practices/dateTimeFormatString/dateTimeFormatString_list.csv ?

cc @amoeba

srearl commented 2 years ago

quoting Matt Jones from a conversation in Slack about this issue...

I think the challenge there is that a column with a 2 digit month in it is not actually a date value in the calendar sense without knowing the year it is attached to. In this case, is the year fixed, or in another column? And is the “month” value an integer (e.g., 3) or a string (e.g., ‘03’)? We’ve discussed the difficulties of this situation a lot, and in the past the consensus was to type the column appropriately for the type of data it contains, and acknowledge that creating a dat-time value from it is not straightforward (it depends on properly casting the value to string, zero padding it, and appending it to the year value from another place). That is much more than a format code is meant to indicate. So, in short, MM is not really a date.

clnsmth commented 2 years ago

Thanks for this question @srearl. I don't have an answer, but do have some thoughts, and definitely want hear what others think.

What does the EML schema say?

_The dateTime field is used for defining the characteristics of the attribute if it contains date and time values. DateTime is used when the values fall on the Gregorian calendar system. More ..._

I don't see anything in the definition that strictly precludes MM as a dateTime. However, we run into ambiguity issues when interpreting the MM dateTime as interval or ordinal values across years of the Gregorian Calendar.

What does ISO say?

_ISO 8601:2000 allowed truncation (by agreement), where leading components of a date or time are omitted. More ..._

However, it goes on ...

This provision was removed in ISO 8601:2004.

So ... is it no longer accepted?

What datetime parsers handle MM?

I'm not aware of any languages that parse MM into a meaningful datetime representation without making assumptions about a reference year and day.

clnsmth commented 2 years ago

@srearl Two advantages of classifying MM as a dateTime in ECC would be to constrain it's type as a "character" string and to verify the zero padding.

srearl commented 2 years ago

Thank you, @clnsmth. Those would be benefits for sure. I definitely see some pluses to this, but, based on input from Matt and Gastil on Slack, I am rethinking this and worry now that including MM or some variation as an acceptable format might lend the impression that month alone is in fact a date. It is frustrating as investigators so often parse year and month.

mobb commented 2 years ago

@srearl can we have an example of the dataset in question? a docid is fine. thx

srearl commented 2 years ago

@mobb - Here is a probably a great case for using metabase as I could probably pull that up in no time. Lacking that, included here are data and corresponding metadata that I am working on currently that prompted the issue (but I encounter month as a parsed date field provided by investigators quite often).

|
  neighborhood:
    attributeName: neighborhood
    attributeDefinition: 'neighborhood name'
    columnClasses: factor
  site_id:
    attributeName: site_id
    attributeDefinition: 'unique parcel identifier'
    columnClasses: character
    definition: ''
  stories:
    attributeName: stories
    attributeDefinition: 'number of stories in the house'
    columnClasses: factor
  house_sqft:
    attributeName: house_sqft
    attributeDefinition: 'house interior square footage'
    unit: 'footSquared'
    numberType: natural
    minimum: 594.0
    maximum: 5646.0
    columnClasses: numeric
  year_built:
    attributeName: year_built
    attributeDefinition: 'construction year'
    columnClasses: Date
    formatString: YYYY
  last_sale_price:
    attributeName: last_sale_price
    attributeDefinition: 'price of last sale (if home has sold)'
    unit: 'number'
    numberType: natural
    minimum: 20000.0
    maximum: 1250000.0
    columnClasses: numeric
  last_sale_month:
    attributeName: last_sale_month
    attributeDefinition: 'month of last sale (if home has sold)'
    columnClasses: Date
    formatString: MM
  last_sale_year:
    attributeName: last_sale_year
    attributeDefinition: 'year of last sale (if home has sold)'
    columnClasses: Date
    formatString: YYYY
  front_yard_area_m2:
    attributeName: front_yard_area_m2
    attributeDefinition: 'estimated area of the front yard'
    unit: 'meterSquared'
    numberType: natural
    minimum: 50.0
    maximum: 1165.0
    columnClasses: numeric

692_parcel_data.csv

mobb commented 2 years ago

Hi @srearl - thanks for the example. It looks like you have already resolved this to some extent, and agree that MM alone is not an acceptable date. The recommendation is to types these fields as strings. An in this case, you could add to the definition some text to indicate that "MM within the year in column 'last_sale_year'"

Another option in this case is to add another column with YYYY and MM concatenated; both YYYYMM and YYYY-MM are in the list of PASTA's preferred date formats. That way you would have both what the scientist submitted (2 columns, YYYY and MM) and a column holding the value of the year-and-month in which the house sold, that is actually typed as a date.