Background: The file "master.py" assumes that there are multiple files in texts/textids, and it creates a thread for each of one of those. (Typically, we've shot to have 8 at once). That means each of the operations in that file (encode, create unigrams, etc) gets done in parallel, on multicore machines.
Currently, there's just one file in that folder. (called 'new'--it gets created by write_metadata function).
To have threading turned back on, there needs to be a line at the end of that script that takes that file of textids, and splits it up into lots of different files that can be read in parallel. (Probably just by using the unix program 'split')
There can't be too many, because each thread gets its own copy of the dictionary with all the words in it--typically about 3 million entries long.
Background: The file "master.py" assumes that there are multiple files in texts/textids, and it creates a thread for each of one of those. (Typically, we've shot to have 8 at once). That means each of the operations in that file (encode, create unigrams, etc) gets done in parallel, on multicore machines.
Currently, there's just one file in that folder. (called 'new'--it gets created by write_metadata function).
To have threading turned back on, there needs to be a line at the end of that script that takes that file of textids, and splits it up into lots of different files that can be read in parallel. (Probably just by using the unix program 'split')
There can't be too many, because each thread gets its own copy of the dictionary with all the words in it--typically about 3 million entries long.