Open bmschmidt opened 7 years ago
Tagging @organisciak to make sure that this is compatible with the current state of the hathi build.
Preliminary version pushed to memory_tables
branch.
Still to do:
bookworm_reload_memory
only materialize certain memory tables. (fastcat
is epecially important).
I'm dissatisfied with several features of the way memory tables are handled.
Current situation
The memory tables speed things up, or did when I benchmarked things five years ago.
There are currently up to four tables for each variable. Let me take the language field from
hathipd
as an example of them.On disk tables
The first is the raw data imported from the user. This is keyed to some id variable--usually, but not necessarily,
bookid
. The field itself is text.A SQL query then generates a table of ids and counts for each language. The ids are useful in queries because they make grouping and filtering queries significantly faster. The ids are in decreasing prevalence; this isn't strictly necessarily, but enables a very useful query restraint I use a lot that uses only the most common elements.
{"languages__id":{"lte":10}}
.In order to get this property, the
__id
field is actually created by this query. It might be possible to create in python.The
__count
field is not, AFAIK, used anywhere and could safely be deleted.In memory tables
Those tables persist through restart, but are not used in queries. Instead, two other queries are used.
The first is the
Lookup
table that does a hashmap from ids to keys.If you search for
"languages":["English"]
, this is the field that will be checked during the search for each book. (I assume, though I'm not sure, that after the first hit mysql will actually cache the proper languages__id and this table isn't actually consulted 10 million times).This table is just about the same as languages__id, except that
languages -> languages__id
index, because those sometimes confuse the query optimizer.count
field.In order to hit that lookup table, there needs to be another memory table that maps from bookid (or other id) to the language id. This is called
heap
.This memory table is created with a merge of two of the above tables.
Solution
There's potentially a single solution for all three of the above problems.
Lookup
table is empty, use the__id
table instead (with index hints to avoid paralyzing wrong-way lookups).bookid->languages__id
map (likelanguagesheap
). This will make creation of thelanguagesheap
table faster.Turning on memory tables then becomes a specific performance optimization you can turn on specific bookworms, without requiring a huge memory overhead for every bookworm.
Namespace issues
It would be nice to do a full rename of all these tables, but that might break some existing bookworms. So instead I think the path of least resistance is to add two new disk tables, called
languagesheap_
andlanguages__id_
. (i.e., trailing underscores). If the memory table is empty or nonexistent, they'll be used instead; and the creation of the memory table will be from this table, rather than from the other on-disk tables. (The old on-disk tables could be deleted, I guess).base tables
This process can apply to the base tables as well. The disk -> memory pattern also exists in the table pairs
catalog
->fastcat
words
->wordsheap
The creation of these tables (especially fastcat) is going to be a big pain on the new
hathi
bookworm; so the existence of afastcat_
table will help there a lot.