Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Memory table creation bogs down with too many metadata fields on the main catalog #71

Open bmschmidt opened 9 years ago

bmschmidt commented 9 years ago

I've just encountered a new limit case on the creation of memory tables.

When there are many large character vectors being used as part of a whole bunch of categorical joins, you can get a query like the following that gets stuck in the statistics state.

INSERT INTO tmp SELECT bookid,nwords,recordid__id,othername__id,canon__id,imprint__id,impaud__id,nationality__id,docid__id,author__id,pubname__id,inferreddate__id,recept__id,enumcron__id,OCLC__id,birth__id,judge__id,yrrev,gender__id,notes__id,firstpub,actualdate,title__id,pubrev FROM catalog  JOIN recordid__id USING (recordid )   JOIN othername__id USING (othername )   JOIN canon__id USING (canon )   JOIN imprint__id USING (imprint )   JOIN impaud__id USING (impaud )   JOIN nationality__id USING (nationality )   JOIN docid__id USING (docid )   JOIN author__id USING (author )   JOIN pubname__id USING (pubname )   JOIN inferreddate__id USING (inferreddate )   JOIN recept__id USING (recept )   JOIN enumcron__id USING (enumcron )   JOIN OCLC__id USING (OCLC )   JOIN birth__id USING (birth )   JOIN judge__id USING (judge )   JOIN gender__id USING (gender )   JOIN notes__id USING (notes )   JOIN title__id USING (title ) 

Apparently MySQL/MariaDB spends a lot of time descending down all of these tables trying to optimize the joins in some kind of N-factorial operation rather than just taking a decent join and running with it.

The normal way to address this is to change optimizer_search_depth (as described here). A first stab at setting a session variable in OneClick.py failed; there may be better ways to do that through the python mysqldb plugin.

Another option is to have fewer tables, or to mark them as not being "unique".