Open clnsmth opened 1 month ago
If we expand the list of supported formats, we could consider changing the handling of unsupported formats from warnings to errors. The rational is that the expanded list covers a wide range of commonly used and unambiguous date-time formats and any remaining unsupported formats are invalid and should be rejected from publication.
A key issue addressed by PEP-4 is the publication of date-time values in the repository that aren't checked due to unsupported formats. We've previously considered expanding the list of preferred formats to address this.
However, we could address this more directly by having two lists. One that defines the preferred formats, allowing us to maintain our focus on ISO 8601 as the preferred standard, and a second expanded list that is used when checking format-value congruence.
Below are preliminary decisions on PEP-4, with associated action items.
Use uppercase letters for date components (e.g., YYYY, MM, DD for year, month, day) and lowercase letters for time components (e.g., hh:mm:ss for hours, minutes, seconds), for sake of consistency.
Support additional date component separator, specifically "/", in addition to the existing "-" separator. This accommodates formatting commonly submitted to the repository, and used within the research community.
Represent individual date and time components (e.g., year, hour) as numeric EML AttributeType / measurementScale
rather than using the dateTime
type.
YYYY
from the list of supported dates and times.Continue to recommend ISO 8601 in data packaging best practices.
Develop a date and time checker library to:
To facilitate programmatic data reads, and conversion between formats, we will provide mappings between EML format strings and common representations in languages like R and Python. This mapping could be included in the date and time checker library, made accessible as a web service, or provided as a resource in a PASTAplus GitHub repository.
Example:
EML format string | strftime/strptime format codes |
---|---|
YYYY-MM-DD | %Y-%m-%d %H:%M:%S |
Zero-padding will not be required, as most programming languages can interpret these formats accurately without it. However, this may affect regex-based congruence checks, so we will verify this does not impact the PostgreSQL database used by the ECC.
Formats like dd-mon-yyyy
(e.g., using abbreviated months) will not be included in the preferred list due to challenges related to supporting multiple languages.
Formats outside of the newly expanded list of preferred formats will continue to be met with a warning. This will allow valid formats, not in the expanded list (due to oversight), to enter the repository.
We will seek community feedback, to review and help finalize these recommendations.
During our recent meeting, we discussed PEP-4 and raised several important points that are summarized here for further consideration:
Item 1: Representation of Date and Time Components
Issue: Date and time components should not be described as
dateTime
types in EML.Proposal: Individual components of date and/or time (e.g., year, hour) should be represented using numeric EML
AttributeType / measurementScale
rather than adateTime
type. This approach allows for the assignment of a unit from the EML standard unit dictionary to describe the date or time component (e.g.,nominalYear
,nominalHour
).Rationale: A year value, for example, is neither a complete date nor a time. Using
dateTime
for these components goes against the schema definition for that type. Assigning them to a numeric type allows for correct unit specification.Action: The supported list of formats used by the ECC and ezEML congruence checkers needs to be reviewed and updated accordingly.
Item 2: Promoting Automation in Data Reading
Goal: We aim to promote automation for reading data to streamline usage by applications and researchers.
Challenge: This requires converting the format string declared in the EML into one that is compatible with the target application. This conversion can be complex, as shown by our experience with ECC and DEX applications.
Recommendation: ISO-8601 is widely recognized across many applications and remains a strong candidate for a date and time standard. However, we need to survey other commonly used formats to evaluate whether they should also be supported, while balancing automation needs with format flexibility.
Tentative Agreement: We should aim for a balance between enabling automation and extending support to widely-used formats that may not be fully automatable.
Item 3: Zero-Padded Dates and Times
Current State: Zero padding for date and time values is required by the current list of supported formats. While some programs tend to drop leading zeros when writing to file, anecdotal evidence suggests this doesn’t impact the data’s readability.
Action: This behavior should be verified, as it may have implications for format checking.
Item 4: Case Sensitivity in Date and Time Formats
Observation: Data packages in production at the repository show significant variance in case usage for date and time formats (e.g.,
yyyy-MM-dd
vs.YYYY-MM-DD
).Standard: ISO-8601 specifies that date components should use uppercase letters and time components should use lowercase letters for consistency across human and machine readers.
Discussion: We might not need to enforce strict case sensitivity, as context (e.g.,
MM
in a date) can differentiate between date and time components without confusion.