eprbell / dali-rp2

DaLI (Data Loader Interface) is a data loader and input generator for RP2 (https://pypi.org/project/rp2), the privacy-focused, free, open-source cryptocurrency tax calculator: DaLI removes the need to manually prepare RP2 input files. Just like RP2, DaLI is also free, open-source and it prioritizes user privacy.
https://pypi.org/project/dali-rp2/
Apache License 2.0
63 stars 42 forks source link

Attempt faster datetime parsing with backports.datetime_fromisoformat #151

Closed qwhelan closed 1 year ago

qwhelan commented 1 year ago

As part of Python 3.11, datetime.fromisoformat() greatly expanded its supported formats, which makes it more broadly useful; the backports.datetime_fromisoformat package extends support for this functionality back to Python 3.7.

Notably, datetime.fromisoformat() is about 400x faster than dateutil.parser.parse():

In [1]: from backports.datetime_fromisoformat import MonkeyPatch

In [2]: MonkeyPatch.patch_fromisoformat()

In [3]: from datetime import datetime

In [4]: test_str = "2022-03-31 00:00:00.083000Z"

In [5]: %timeit datetime.fromisoformat(test_str)
141 ns ± 1.24 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]: from dateutil.parser import parse

In [7]: %timeit parse(test_str)
56.8 µs ± 686 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The latter's runtime is the vast majority of the time taken for AbstractTransaction.__init__():

In [3]: %lprun -f AbstractTransaction.__init__ InTransaction('test', 'a', 'b', '2023-03-30 00:00:00Z', 'c', 'd', 'e', 'BUY', '1', '1')
Timer unit: 1e-09 s

Total time: 0.000329842 s
File: /home/chris/anaconda3/envs/coinbase/lib/python3.10/site-packages/dali/abstract_transaction.py
Function: __init__ at line 95

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    95                                               def __init__(
    96                                                   self,
    97                                                   plugin: str,
    98                                                   unique_id: str,
    99                                                   raw_data: str,
   100                                                   timestamp: str,
   101                                                   asset: str,
   102                                                   notes: Optional[str] = None,
   103                                                   is_spot_price_from_web: Optional[bool] = None,
   104                                                   fiat_ticker: Optional[str] = None,
   105                                               ) -> None:
   106         1      13720.0  13720.0      4.2          self.__plugin: str = self._validate_string_field(Keyword.PLUGIN.value, plugin, raw_data, disallow_empty=True, disallow_unknown=True)
   107         1       3301.0   3301.0      1.0          self.__unique_id: str = self._validate_string_field(Keyword.UNIQUE_ID.value, unique_id, raw_data, disallow_empty=True, disallow_unknown=False)
   108         1        565.0    565.0      0.2          if unique_id.startswith("0x"):
   109                                                       self.__unique_id = unique_id[len("0x") :]
   110         1       3846.0   3846.0      1.2          self.__raw_data: str = self._validate_string_field(Keyword.RAW_DATA.value, raw_data, raw_data, disallow_empty=True, disallow_unknown=True)
   111         1        173.0    173.0      0.1          self.__timestamp: str
   112         1        121.0    121.0      0.0          self.__timestamp_value: datetime
   113         1     294028.0 294028.0     89.1          (self.__timestamp, self.__timestamp_value) = self._validate_timestamp_field(Keyword.TIMESTAMP.value, timestamp, raw_data)
   114         1       8703.0   8703.0      2.6          self.__asset: str = self._validate_string_field(Keyword.ASSET.value, asset, raw_data, disallow_empty=True, disallow_unknown=True)
   115         1       3209.0   3209.0      1.0          self.__notes: Optional[str] = self._validate_optional_string_field(Keyword.NOTES.value, notes, raw_data, disallow_empty=False, disallow_unknown=True)
   116         1        245.0    245.0      0.1          if is_spot_price_from_web and not isinstance(is_spot_price_from_web, bool):
   117                                                       raise RP2RuntimeError(f"Internal error: {Keyword.IS_SPOT_PRICE_FROM_WEB.value} is not boolean: {is_spot_price_from_web}")
   118         1        553.0    553.0      0.2          self.__is_spot_price_from_web: bool = is_spot_price_from_web if is_spot_price_from_web else False
   119         1       1197.0   1197.0      0.4          self.__fiat_ticker: Optional[str] = self._validate_optional_string_field(
   120         1        181.0    181.0      0.1              "fiat_ticker", fiat_ticker, raw_data, disallow_empty=True, disallow_unknown=True
   121                                                   )

Which contributes to the slow transaction creation time:

%timeit InTransaction('test', 'a', 'b', '2023-03-30 00:00:00Z', 'c', 'd', 'e', 'BUY', '1', '1')
172 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Utilizing backport.datetime_fromisoformat halves the creation time of a transaction when parse-able by datetime.fromisoformat():

%timeit InTransaction('test', 'a', 'b', '2023-03-30 00:00:00Z', 'c', 'd', 'e', 'BUY', '1', '1')
97.2 µs ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
qwhelan commented 1 year ago

The backstory on this PR is that I noticed dali was spending a surprisingly large amount of time reading my .csv files before even getting to the stage of hitting the network for price information. So it's impactful in my case but that doesn't necessarily make it worth it for everyone else.

Given datetime.fromisoformat() exists since Python 3.7, just with a very limited supported format, we could just drop the backport module and keep the fallback behavior. It's not going to be as beneficial as 3.7-3.10 users will fall back to dateutil at a much higher rate, but we also don't need to explicitly drop support in that case.