freelawproject / courts-db

A database of courts, tests and other experiments
BSD 2-Clause "Simplified" License
58 stars 15 forks source link

Lazy load data structures in courts-db #16

Closed jcushman closed 3 years ago

jcushman commented 3 years ago

I noticed that eyecite calling from courts_db import courts takes about 600 milliseconds, which is all spent in the line regexes = gather_regexes(courts), compiling a bunch of regexes that eyecite doesn't actually use.

(I noticed this because somewhere in our Django app there's an import eyecite line that makes ./manage.py shell take an extra second, most of which turns out to be due to this courts-db regex thing. It's not important in the context of actual eyecite performance, but it's noticeable if you happen to do an import eyecite in some script.)

So I thought a really neat and fun way to solve that would be with this new Python 3.7+ feature that lets modules define module-level __getattr__ functions, so you can lazy-load variables in a module just when they're needed. I wanted some excuse to try that out, so here's a PR attempting to do that and make it readable and not too verbose. A major downside would be increasing minimum supported version to Py 3.7, and I won't be offended if you're like "nah."

The other alternative I considered was to add a dependency on a lazy-object wrapper like this, so courts-db could just do something like courts = lazy_object_proxy.Proxy(load_courts_db). That could be the better call, it just requires picking a good lazy-object library and adding the dependency (and didn't let me mess around with module-level __getattr__).

mlissner commented 3 years ago

This is neat. I don't really like that loaders module though. It feels like a lot of cruft to accomplish lazy loading. I don't mind the 3.6 breakage though, that seems fine.

jcushman commented 3 years ago

Ah yeah, that's not really helping much. Here's what it looks like without the separate loaders module.

mlissner commented 3 years ago

Yeah, that's definitely better. I assume these calls: from . import courts don't create 600ms delays every time they're run?

mlissner commented 3 years ago

I'm surprised this doesn't affect reporters DB even more. Do I predict a sister PR in a few minutes?

jcushman commented 3 years ago

I assume these calls: from . import courts don't create 600ms delays every time they're run?

Right, the caching all works as you would hope because the results are stored in globals(), so __getattr__ only gets called the first time a variable is used:

In [1]: %time import courts_db
CPU times: user 3.19 ms, sys: 3.43 ms, total: 6.63 ms
Wall time: 8.94 ms

In [2]: %time from courts_db import courts
CPU times: user 10.4 ms, sys: 6.08 ms, total: 16.4 ms
Wall time: 18.7 ms

In [3]: %time from courts_db import court_dict
CPU times: user 157 µs, sys: 5 µs, total: 162 µs
Wall time: 168 µs

In [4]: %time from courts_db import regexes
CPU times: user 628 ms, sys: 4.87 ms, total: 633 ms
Wall time: 640 ms

In [5]: %time from courts_db import regexes
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 6.91 µs

You can see regexes is really the only particularly slow one, but it seemed cleaner to treat all the global data structures the same.

I'm surprised this doesn't affect reporters DB even more. Do I predict a sister PR in a few minutes?

Computers are so fast tho; import reporters_db runs in about 40ms despite everything it's doing. Here's time to load each variable in reporters-db in seconds:

REPORTERS 0.02842116355895996
STATE_ABBREVIATIONS 0.0002818107604980469
CASE_NAME_ABBREVIATIONS 0.0002448558807373047
LAWS 0.0024738311767578125
JOURNALS 0.0036020278930664062
RAW_REGEX_VARIABLES 0.0001742839813232422
REGEX_VARIABLES 0.00030493736267089844
VARIATIONS_ONLY 0.0011582374572753906
EDITIONS 0.0005590915679931641
NAMES_TO_EDITIONS 0.0030679702758789062
SPECIAL_FORMATS 0.0004608631134033203
TOTAL 0.041039228439331055

If we like this lazy-loading approach I think it would be totally reasonable to do it over there as well, just as a best practice for libraries that expose big data structures loaded from disk -- if you're a python library to load big blobs of stuff then do it lazily. But it's less compelling performance-wise.

mlissner commented 3 years ago

Thanks. That's interesting. I wonder why the gather_regexes here is so much slower, but we can get into that another day. Thanks for fixing this.

I'm not sure the microseconds for reporters is worth the clutter, but if you want it for consistency, that's fine by me.