Closed srearl closed 2 years ago
quoting Matt Jones from a conversation in Slack about this issue...
I think the challenge there is that a column with a 2 digit month in it is not actually a date value in the calendar sense without knowing the year it is attached to. In this case, is the year fixed, or in another column? And is the “month” value an integer (e.g., 3) or a string (e.g., ‘03’)? We’ve discussed the difficulties of this situation a lot, and in the past the consensus was to type the column appropriately for the type of data it contains, and acknowledge that creating a dat-time value from it is not straightforward (it depends on properly casting the value to string, zero padding it, and appending it to the year value from another place). That is much more than a format code is meant to indicate. So, in short, MM is not really a date.
Thanks for this question @srearl. I don't have an answer, but do have some thoughts, and definitely want hear what others think.
What does the EML schema say?
_The dateTime field is used for defining the characteristics of the attribute if it contains date and time values. DateTime is used when the values fall on the Gregorian calendar system. More ..._
I don't see anything in the definition that strictly precludes MM
as a dateTime. However, we run into ambiguity issues when interpreting the MM
dateTime as interval or ordinal values across years of the Gregorian Calendar.
What does ISO say?
_ISO 8601:2000 allowed truncation (by agreement), where leading components of a date or time are omitted. More ..._
However, it goes on ...
This provision was removed in ISO 8601:2004.
So ... is it no longer accepted?
What datetime parsers handle MM?
I'm not aware of any languages that parse MM
into a meaningful datetime representation without making assumptions about a reference year and day.
@srearl Two advantages of classifying MM
as a dateTime in ECC would be to constrain it's type as a "character" string and to verify the zero padding.
Thank you, @clnsmth. Those would be benefits for sure. I definitely see some pluses to this, but, based on input from Matt and Gastil on Slack, I am rethinking this and worry now that including MM
or some variation as an acceptable format might lend the impression that month alone is in fact a date. It is frustrating as investigators so often parse year and month.
@srearl can we have an example of the dataset in question? a docid is fine. thx
@mobb - Here is a probably a great case for using metabase as I could probably pull that up in no time. Lacking that, included here are data and corresponding metadata that I am working on currently that prompted the issue (but I encounter month as a parsed date field provided by investigators quite often).
|
neighborhood:
attributeName: neighborhood
attributeDefinition: 'neighborhood name'
columnClasses: factor
site_id:
attributeName: site_id
attributeDefinition: 'unique parcel identifier'
columnClasses: character
definition: ''
stories:
attributeName: stories
attributeDefinition: 'number of stories in the house'
columnClasses: factor
house_sqft:
attributeName: house_sqft
attributeDefinition: 'house interior square footage'
unit: 'footSquared'
numberType: natural
minimum: 594.0
maximum: 5646.0
columnClasses: numeric
year_built:
attributeName: year_built
attributeDefinition: 'construction year'
columnClasses: Date
formatString: YYYY
last_sale_price:
attributeName: last_sale_price
attributeDefinition: 'price of last sale (if home has sold)'
unit: 'number'
numberType: natural
minimum: 20000.0
maximum: 1250000.0
columnClasses: numeric
last_sale_month:
attributeName: last_sale_month
attributeDefinition: 'month of last sale (if home has sold)'
columnClasses: Date
formatString: MM
last_sale_year:
attributeName: last_sale_year
attributeDefinition: 'year of last sale (if home has sold)'
columnClasses: Date
formatString: YYYY
front_yard_area_m2:
attributeName: front_yard_area_m2
attributeDefinition: 'estimated area of the front yard'
unit: 'meterSquared'
numberType: natural
minimum: 50.0
maximum: 1165.0
columnClasses: numeric
Hi @srearl - thanks for the example. It looks like you have already resolved this to some extent, and agree that MM
alone is not an acceptable date. The recommendation is to types these fields as strings. An in this case, you could add to the definition some text to indicate that "MM within the year in column 'last_sale_year'"
Another option in this case is to add another column with YYYY and MM concatenated; both YYYYMM
and YYYY-MM
are in the list of PASTA's preferred date formats. That way you would have both what the scientist submitted (2 columns, YYYY and MM) and a column holding the value of the year-and-month in which the house sold, that is actually typed as a date.
@clnsmth - I have run into a situation where an investigator has submitted a dataset that includes a month field (as MM). This is not a supported dataTimeFormat in PASTA+ but does seem to be included in the ISO 8601 standard (scroll to "Truncated representations" at https://en.wikipedia.org/wiki/ISO_8601). In my case, the format for month would be
--MM
. Do you think that we could add these truncated representations to https://github.com/EDIorg/ECC/blob/master/practices/dateTimeFormatString/dateTimeFormatString_list.csv ?cc @amoeba