fox-it / flow.record

Recordization library
GNU Affero General Public License v3.0
7 stars 9 forks source link

Speedup parsing of datetime fieldtypes initialization by string #87

Closed yunzheng closed 10 months ago

yunzheng commented 11 months ago

This change mainly removes the use of expensive regexes and exception handling, improving the speed significantly.

codecov[bot] commented 11 months ago

Codecov Report

Merging #87 (9b268b2) into main (f0a2608) will increase coverage by 0.04%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #87      +/-   ##
==========================================
+ Coverage   79.22%   79.27%   +0.04%     
==========================================
  Files          32       32              
  Lines        2932     2939       +7     
==========================================
+ Hits         2323     2330       +7     
  Misses        609      609              
Flag Coverage Δ
unittests 79.27% <100.00%> (+0.04%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
flow/record/fieldtypes/__init__.py 91.53% <100.00%> (+0.13%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

yunzheng commented 11 months ago

Benchmark script

Benchmarked using the following script.

#!/bin/sh -x

# default datetime str without any TZ data
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37")'

# default isoformat(), this is how it's serialized in msgpack for tz aware datetime objects
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37+03:00")'

# RFC3339Nano, 2006-01-02T15:04:05.999999999, used by Docker
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37.123456789999999")'
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37.123456789999999-02:00")'
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37.123456789999999Z")'

# other variants, but less common
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37Z")'
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37.123456-02:00")'
python3 -m timeit -n 100000  "from flow.record.fieldtypes import datetime" 'datetime("2023-09-04T13:33:37.123456789+03:00")'

Benchmark results

Python 3.8

# old new
0 3.48 usec per loop 3.5 usec per loop -0.02
1 15.9 usec per loop 2.51 usec per loop 13.39
2 8.79 usec per loop 4.35 usec per loop 4.44
3 31.5 usec per loop 3.5 usec per loop 28.0
4 27.6 usec per loop 3.2 usec per loop 24.4
5 13.4 usec per loop 2.57 usec per loop 10.83
6 18.3 usec per loop 3.47 usec per loop 14.83
7 32.8 usec per loop 3.29 usec per loop 29.51

Python 3.9

# old new
0 3.6 usec per loop 3.53 usec per loop 0.07
1 16.6 usec per loop 2.55 usec per loop 14.05
2 9.07 usec per loop 4.33 usec per loop 4.74
3 32.1 usec per loop 3.46 usec per loop 28.64
4 28.2 usec per loop 3.36 usec per loop 24.84
5 13.7 usec per loop 2.63 usec per loop 11.07
6 18.4 usec per loop 3.48 usec per loop 14.92
7 32.0 usec per loop 3.18 usec per loop 28.82

Python 3.10

# old new
0 3.11 usec per loop 3.14 usec per loop -0.03
1 13.2 usec per loop 2.29 usec per loop 10.91
2 7.14 usec per loop 4.05 usec per loop 3.09
3 24.6 usec per loop 3.61 usec per loop 20.99
4 21.7 usec per loop 3.05 usec per loop 18.65
5 9.99 usec per loop 2.55 usec per loop 7.44
6 13.8 usec per loop 3.29 usec per loop 10.51
7 24.5 usec per loop 2.99 usec per loop 21.51

Python 3.11

note, Python3.11 was already fast pathed previously.

# old new
0 2.12 usec per loop 2.12 usec per loop 0.0
1 1.54 usec per loop 1.33 usec per loop 0.21
2 2.24 usec per loop 2.12 usec per loop 0.12
3 1.45 usec per loop 1.43 usec per loop 0.02
4 1.37 usec per loop 1.34 usec per loop 0.03
5 1.41 usec per loop 1.3 usec per loop 0.11
6 1.45 usec per loop 1.38 usec per loop 0.07
7 1.46 usec per loop 1.41 usec per loop 0.05