dragnet-org / dragnet

Just the facts -- web page content extraction
MIT License
1.25k stars 179 forks source link

Use more parallelization in training, other speedups #79

Open trifle opened 5 years ago

trifle commented 5 years ago

Hi Matt, Dan,

thanks for this wonderful library.

While training some augmented models, I noticed that there are some steps in the process which could benefit a lot from parallelization. There are also small corners where expanding the interface a bit would streamline processing in some cases.

I could submit one or several PRs, but want to ask whether you would be willing to have them.

I have a fork that I'm collecting changes in.

So far I'd propose to:

Thanks, best, Pascal

matt-peters commented 5 years ago

Hi Pascal,

All of those changes sound great to me! Let's split them up into several PRs, it will be much easier to review and merge if it is broken up. I'd suggest breaking it up into 4 PRs following your bullet points.

For the n_jobs=-1: it's fine to train with this, but I found large performance slowdown with using it for prediction as sklearn was creating and destroying a thread pool for test instance. This was a few versions of sklearn ago, I don't know if it has since been fixed. So I had a hack to always run n_jobs=1 for prediction unless otherwise explicitly overridden.

trifle commented 5 years ago

Hi @matt-peters, sounds like a good plan.

It's a good idea to actually benchmark each of the changes separately. I guess I'll find the time to split the PRs and document them the week after the next.

Best, Pascal