APrioriInvestments / typed_python

An llvm-based framework for generating and calling into high-performance native code from Python.
Apache License 2.0
198 stars 8 forks source link

Add timestamp library and tests #403

Open launeh opened 2 years ago

launeh commented 2 years ago

Motivation and Context

This pr adds a Timestamp class that wraps a UNIX timestamp with functionality for datetime parsing and formatting. Secondary changes add supporting library/toolkit functionality for processing datetimes. Notably, all the changes are in typed_python and Entrypointable.

Approach

The Timestamp class wraps a UNIX timestamp. This UNIX timestamp can be provided, parsed from a string representing a datetime, or constructed from a set of values representing a datetime.

For e.g, you can create a Timestamp from a unix timestamp with any of the following statements.

ts1 = Timestamp.make(1654615145)
ts2 = Timestamp(ts=1654615145)
ts3 = Timestamp.from_date(year=2022, month=11, day=20)

The module provides 3 ways to create timestamps from string representation of dates. 1: You can tell the parser the format of the provided datestring. This is the most efficient option. This is equivalent to datetime.strptime(). E.g.

ts1 = Timestamp.parse_with_format(date_str="2022-01-05", format="%Y-%m-%d")

2: If the string is any variant of an ISO 8601 formatted string, you can use the .parse_iso_str method. This method is slightly more permissive than the ISO 8601 standard in that it allows a space for the datetime separator (in addition to 'T') and allows timezone abbreviations E.g.

  ts1 = Timestamp.parse_iso_str("2022-01-05T10:11:12")
  ts2 = Timestamp.parse_iso_str("2022-01-05T10:11:12")
  ts3 = Timestamp.parse_iso_str("2022-01-05 10:11:12-0500")
  ts4 = Timestamp.parse_iso_str("2022-01-05 10:11:12ET")
  ts5 = Timestamp.parse_iso_str("2022-01-05 10:11:12NYC")

3: Can parse a range of non-iso date formats with.parse_non_iso_str. E.g

  ts1 = Timestamp.parse_iso_str("January 1, 1997")
  ts2 = Timestamp.parse_iso_str("Jan-1-1997")
  ts3 = Timestamp.parse_iso_str("1-Jan-1997")

For convenience, there's a multi-use .parse() entry point. That will parse a datestring with a format if provided. If no format string is provided, .parsewill attempt to parse the date_str as an ISO 8601 string. Failing that, it attempts to parse using the supported non-iso formats.

  ts1 = Timestamp.parse("January 01, 1997", "%B %d, %Y" )
  ts2 = Timestamp.parse("1997-01-01")
  ts3 = Timestamp.parse("1-Jan-1997")

You can convert Timestamps to strings using standard python time format directives. E.g:

ts = Timestamp.make(1654615145)
print(ts.format(utc_offset=144000))  # 2022-06-09T07:19:05
print(ts.format(format="%Y-%m-%d"))  # 2022-06-09

The functionality for parsing datestrings is implemented in the reusable DateParser component. Specifically, the component exposes DateParser.parse which in turn proxies to DateParser.parse_iso_format and DateParser.parse_non_iso_format. These methods convert a string representation of a datetime to a UNIX timestamp. E.g.

  time = DateParser.parse("2022-01-05T10:11:12+00:15")
  time = DateParser.parse("2022-01-05T10:11:12NYC")

DateParser additionally depends on Timezone. Timestamps are pegged to UTC and do not store timezone information. The parser needs to adjust the timestamp by the appropriate offset from UTC. Timezone provides support for converting a timezone abbreviation to a utc_offset. Timezone offset supports relative zones - meaning if the offset is "ET (Eastern Time)" or "NYC" then it will return either the offset for EST (Eastern Standard Time) or EDT (Eastern Daylight time) as appropriate.

Note: the date parsing logic handles a useful range of non-iso date formats. For example, it will correctly parse dates like "Jan 2, 1997" or "Jan-1-1997" or "1-January-1997". However, parsing of ambiguous dates is NOT supported. For example, attempting to parse a date with a 2 digit year cause the parser to throw an error.

The supporting functionality for formatting Timestamps as strings is implemented in the reusable DateFormatter component. E.g

print(DateFormatter.format(ts=22323232, utc_offset=144000))  # 2022-06-09T07:19:05
print(DateFormatter.format(format="%Y-%m-%d"))  # 2022-06-09

By default DateFormatter.format outputs an ISO 8601 formatted string (YYYY-MM-DDTHH:MM:SS). However, it also accepts a format string (E.g. "%Y-%m-%d") using standard python format directives.

By default DateFormatter.format returns a date string in UTC. However, it also accepts a utc_offset (in seconds) as input.

DateParser and DateFormatter both depend on some low level datetime processing/validation algorithms. For eg. these algorithms let you convert a timestamp to (day, month, year, day of week, weekday, hour, etc) values and vice-versa. These algorithms are implemented in the Chrono component.

How Has This Been Tested?

This PR adds tests for the individual components (DateParser, DateFormatter, Timezone, Chrono). Also adds extensive unit tests for main Timestamp component.

The unit tests compare against standard python objects/builtins where relevant. This means, for example, that the Timestamp.format functionality is checked for correctness against python's Datetime.strftimeand Timestamp.parse* methods are checked for correctness againstDatetime.strptime

For exhaustiveness (and possibly overkill) the tests run over long time ranges (e.g. 'all the days over a span of two years' or 'all the seconds over a span of 3 months)

Types of changes

Checklist: