bmschmidt commented 7 years ago

I'm dissatisfied with several features of the way memory tables are handled.

All documents and vocabulary must be stored in memory, although it would be nice to allow non-memory builds especially on non-dedicated servers.
Memory tables must be reloaded, but this requirement is hard to automate and non-obvious to all users.
The queries that create memory tables rely on some complicated joins that can take a while (I've seen hours) to execute. This means a system restart takes a bookworm offline for hours.

Current situation

The memory tables speed things up, or did when I benchmarked things five years ago.

There are currently up to four tables for each variable. Let me take the language field from hathipd as an example of them.

On disk tables

The first is the raw data imported from the user. This is keyed to some id variable--usually, but not necessarily, bookid. The field itself is text.

MariaDB [hathipd]> SELECT * FROM languagesDisk LIMIT 3;  
+--------+-----------+
| bookid | languages |
+--------+-----------+
|      1 | English   |
|      2 | English   |
|      3 | English   |
+--------+-----------+
3 rows in set (0.00 sec)

A SQL query then generates a table of ids and counts for each language. The ids are useful in queries because they make grouping and filtering queries significantly faster. The ids are in decreasing prevalence; this isn't strictly necessarily, but enables a very useful query restraint I use a lot that uses only the most common elements. {"languages__id":{"lte":10}}.

In order to get this property, the __id field is actually created by this query. It might be possible to create in python.

The __count field is not, AFAIK, used anywhere and could safely be deleted.

SELECT * FROM languages__id LIMIT 3; 
+---------------+-----------+------------------+
| languages__id | languages | languages__count |
+---------------+-----------+------------------+
|             1 | English   |          2793265 |
|             2 | German    |           578369 |
|             3 | French    |           541092 |
+---------------+-----------+------------------+
3 rows in set (0.00 sec)

In memory tables

Those tables persist through restart, but are not used in queries. Instead, two other queries are used.

The first is the Lookup table that does a hashmap from ids to keys.

If you search for "languages":["English"], this is the field that will be checked during the search for each book. (I assume, though I'm not sure, that after the first hit mysql will actually cache the proper languages__id and this table isn't actually consulted 10 million times).

This table is just about the same as languages__id, except that

It lacks a languages -> languages__id index, because those sometimes confuse the query optimizer.
It lives in memory

It lacks the useless count field.

SELECT * FROM languagesLookup LIMIT 3; 
+---------------+-----------+
| languages__id | languages |
+---------------+-----------+
|             1 | English   |
|             2 | German    |
|             3 | French    |
+---------------+-----------+
3 rows in set (0.00 sec)

In order to hit that lookup table, there needs to be another memory table that maps from bookid (or other id) to the language id. This is called heap.

This memory table is created with a merge of two of the above tables.

SELECT * FROM languagesheap LIMIT 3;     
+--------+---------------+
| bookid | languages__id |
+--------+---------------+
|      1 |             1 |
|      2 |             1 |
|      3 |             1 |
+--------+---------------+
3 rows in set (0.00 sec)

Solution

There's potentially a single solution for all three of the above problems.

If the Lookup table is empty, use the __id table instead (with index hints to avoid paralyzing wrong-way lookups).
Create a new table on disk with the bookid->languages__id map (like languagesheap). This will make creation of the languagesheap table faster.
On db queries, try to use the memory tables; if they don't exist (or have no rows, indicating a restart) fall back to using the on-disk versions.

Turning on memory tables then becomes a specific performance optimization you can turn on specific bookworms, without requiring a huge memory overhead for every bookworm.

Namespace issues

It would be nice to do a full rename of all these tables, but that might break some existing bookworms. So instead I think the path of least resistance is to add two new disk tables, called languagesheap_ and languages__id_. (i.e., trailing underscores). If the memory table is empty or nonexistent, they'll be used instead; and the creation of the memory table will be from this table, rather than from the other on-disk tables. (The old on-disk tables could be deleted, I guess).

base tables

This process can apply to the base tables as well. The disk -> memory pattern also exists in the table pairs

catalog->fastcat
words -> wordsheap

The creation of these tables (especially fastcat) is going to be a big pain on the new hathi bookworm; so the existence of a fastcat_ table will help there a lot.

bmschmidt commented 7 years ago

Tagging @organisciak to make sure that this is compatible with the current state of the hathi build.

bmschmidt commented 7 years ago

Preliminary version pushed to memory_tables branch.

Still to do:

Allow bookworm_reload_memory only materialize certain memory tables. (fastcat is epecially important).
Make sure that overwriting an existing variable with new values isn't broken (this should be a new unit test.)
??? Looks relatively complete to me right now.

Bookworm-project / BookwormDB

Memory table revamp. #139

Current situation

On disk tables

In memory tables

Solution

Namespace issues

base tables