Use more parallelization in training, other speedups

trifle commented 5 years ago

Hi Matt, Dan,

thanks for this wonderful library.

While training some augmented models, I noticed that there are some steps in the process which could benefit a lot from parallelization. There are also small corners where expanding the interface a bit would streamline processing in some cases.

I could submit one or several PRs, but want to ask whether you would be willing to have them.

I have a fork that I'm collecting changes in.

So far I'd propose to:

Use a multiprocessing.Pool for data pre-processing in data_processing.py:prepare_all_data(This gives a near-linear speedup, saving a couple of minutes on my 4-core).
Use n_jobs=-1 where possible (e.g. in GridSearchCV). Substantial speedup by default that grows with the number of grid parameters.
Let the Blockifier.blockify interface accept pre-parsed etree instances. I understand that this may be a niche use case, but I'm processing a lot of html where I need to parse the tree before extraction for some pre-processing. Skipping the duplicate parsing saves about 30% overall time: Parsing 1000 entries with pre-existing trees, three run average: 28.9 seconds (user time) Parsing 1000 entries from string, three run average: 37.7 seconds (user time) (Note that this includes the first parsing pass and some trivial overhead for loading the data. Since the time includes python startup and model loading, the 30% saving is a lower bound estimate).
Finally, I've looked at some profiling data and saw that casting the blocks (str_block_list_cast in blocks.pyx: https://github.com/dragnet-org/dragnet/blob/master/dragnet/blocks.pyx#L860) presents a non-trivial overhead. There may be some potential here: I tried simply skipping the entire casting step, only decoding the text to unicode for the regex to work. The extraction seemed to still work fine. So I'm not quite sure in which cases the blocks would be in an unknown (bytes, str) state.

Thanks, best, Pascal

matt-peters commented 5 years ago

Hi Pascal,

All of those changes sound great to me! Let's split them up into several PRs, it will be much easier to review and merge if it is broken up. I'd suggest breaking it up into 4 PRs following your bullet points.

For the n_jobs=-1: it's fine to train with this, but I found large performance slowdown with using it for prediction as sklearn was creating and destroying a thread pool for test instance. This was a few versions of sklearn ago, I don't know if it has since been fixed. So I had a hack to always run n_jobs=1 for prediction unless otherwise explicitly overridden.

trifle commented 5 years ago

Hi @matt-peters, sounds like a good plan.

It's a good idea to actually benchmark each of the changes separately. I guess I'll find the time to split the PRs and document them the week after the next.

Best, Pascal

dragnet-org / dragnet

Use more parallelization in training, other speedups #79