Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Re-introduce parallelization #11

Closed bmschmidt closed 11 years ago

bmschmidt commented 12 years ago

Background: The file "master.py" assumes that there are multiple files in texts/textids, and it creates a thread for each of one of those. (Typically, we've shot to have 8 at once). That means each of the operations in that file (encode, create unigrams, etc) gets done in parallel, on multicore machines.

Currently, there's just one file in that folder. (called 'new'--it gets created by write_metadata function).

To have threading turned back on, there needs to be a line at the end of that script that takes that file of textids, and splits it up into lots of different files that can be read in parallel. (Probably just by using the unix program 'split')

There can't be too many, because each thread gets its own copy of the dictionary with all the words in it--typically about 3 million entries long.

bmschmidt commented 11 years ago

This is mostly complete, but should be tested further (particularly in the 'clean' stages.)

bmschmidt commented 11 years ago

Let's call this complete now.