Instagram / MonkeyType

A Python library that generates static type annotations by collecting runtime types
Other
4.74k stars 170 forks source link

Improve database structure #315

Closed chrhansk closed 8 months ago

chrhansk commented 8 months ago

I noticed a problem with the database created by MonkeyType during the tracing process. Namely, the database is very flat, only consisting of a single table:

        CREATE TABLE IF NOT EXISTS monkeytype_call_traces (
          created_at  TEXT,
          module      TEXT,
          qualname    TEXT,
          arg_types   TEXT,
          return_type TEXT,
          yield_type  TEXT);

First of all, the created_at column is redundant and never used. I suspect that this was put into the database in order to ensure uniqueness of columns (which is ensured anyways due to rowids).

Most problematically, no deduplication is being done, new records are just entered into the database one after another. As a consequence the database becomes incredibly large, even when no new and information is added. For instance, when I trace the call my_method(value=1) one million times, the database contains one million records, where a single one would do. This is particularly bothersome when monkeytype is used within pytest, where parameterized tests quickly run into the thousands or millions. I was trying to type out a larger package and ended up with a file with a size of >= 100GB.

The new structure splits the database into multiple tables for signatures, modules, and functions, where signatures are deduplicated. In my example, this reduced the database size to below 1MB.

carljm commented 8 months ago

Hi! Appreciate the PR. I think I don't want to make this change, though.

The current database design is duplicative, but it's intended to represent a historical record of calls observed. Both relative frequency of different observed signatures, and recency, are potentially relevant data lost in your normalization. Both of those are implicitly used because the query limits rows fetched, and orders by created_at (so that field is used.) The intention is that if you are running MonkeyType in production with a changing code base, "old" calls will age out as new ones are observed.

More basically, I don't want to change this because it is backwards-incompatible for current MonkeyType users who may have trace dbs sitting around; those shouldn't become unusable on a MonkeyType update.

The CallTraceStore interface, and the ability to configure a custom CallTraceStore, exist so that if the simple default design doesn't suit your needs, you are free to create your own store and use it. If the version in this PR suits your needs better, I recommend you take that route.