ericleasemorgan / reader

Distant Reader, a tool for using & understanding a corpus
GNU General Public License v2.0
20 stars 7 forks source link

Add indexes to CORD sqlite schema #129

Closed dbrower closed 4 years ago

dbrower commented 4 years ago

A few CORD processing scripts use these to index into the documents table. The indicies keep it from doing a full table scan when doing these lookups.

Also change a few SQL strings to use bash interpolation instead of shelling out to sed.

ericleasemorgan commented 4 years ago

Adding indexes. Very intelligent. I do not think most people would have seen that. dbrower++ We will hope the time spent creating the index will make retrieval speedier. When it comes to the removal of the template, well, I can live with that, but in most cases I do not desire to hard-code strings like that in the application.

In short, good work. Few people have looked at the code at the same level as you.

dbrower commented 4 years ago

Yeah, I've been trying to parallelize the CORD dataset processing, and indexes are the smallest thing. Each json2corpus.sh task, while executed in parallel, all use sqlite3 on the same database, and sqlite uses a file lock to serialize database access. Speeding up the sqlite commands should make everything faster. Not sure by how much, though.

I understand what you are thinking about the strings. My take is that 1) the string literal is still in the code, just later in the file, and 2) now the script doesn't start a subprocess to do the interpolation. But I'll try to minimize doing it that way in the future.