Open trifle opened 5 years ago
Hi Pascal,
All of those changes sound great to me! Let's split them up into several PRs, it will be much easier to review and merge if it is broken up. I'd suggest breaking it up into 4 PRs following your bullet points.
For the n_jobs=-1: it's fine to train with this, but I found large performance slowdown with using it for prediction as sklearn was creating and destroying a thread pool for test instance. This was a few versions of sklearn ago, I don't know if it has since been fixed. So I had a hack to always run n_jobs=1 for prediction unless otherwise explicitly overridden.
Hi @matt-peters, sounds like a good plan.
It's a good idea to actually benchmark each of the changes separately. I guess I'll find the time to split the PRs and document them the week after the next.
Best, Pascal
Hi Matt, Dan,
thanks for this wonderful library.
While training some augmented models, I noticed that there are some steps in the process which could benefit a lot from parallelization. There are also small corners where expanding the interface a bit would streamline processing in some cases.
I could submit one or several PRs, but want to ask whether you would be willing to have them.
I have a fork that I'm collecting changes in.
So far I'd propose to:
data_processing.py:prepare_all_data
(This gives a near-linear speedup, saving a couple of minutes on my 4-core).Parsing 1000 entries with pre-existing trees, three run average: 28.9 seconds (user time)
Parsing 1000 entries from string, three run average: 37.7 seconds (user time)
(Note that this includes the first parsing pass and some trivial overhead for loading the data. Since the time includes python startup and model loading, the 30% saving is a lower bound estimate).str_block_list_cast
in blocks.pyx: https://github.com/dragnet-org/dragnet/blob/master/dragnet/blocks.pyx#L860) presents a non-trivial overhead. There may be some potential here: I tried simply skipping the entire casting step, only decoding the text to unicode for the regex to work. The extraction seemed to still work fine. So I'm not quite sure in which cases the blocks would be in an unknown (bytes, str) state.Thanks, best, Pascal