Toilal / rebulk

Define simple search patterns in bulk to perform advanced matching on any string
MIT License
55 stars 9 forks source link

Performance analysis and improvements #8

Closed ratoaq2 closed 7 years ago

ratoaq2 commented 7 years ago

I've been profiling rebulk and guessit and I have some preliminary results that I could share.

My scenario is:

I measured the time spent (before / after) and I also used cProfile and line_profiler to help me understand where most of the time was spent.

In my machine, for the current guessit version and current rebulk version:

1000 release names takes 37 seconds to be processed: 37ms per release name

For the modified rebulk version:

1000 release names takes 22.5 seconds to be processed: 22ms per release name (+36% faster)

The first hotspot was the usage of call to instantiate Match objects. Almost a million Match objects are created and every call execution was introspecting the valid kwargs to be used. Using kwargs instead of call reduced the total time from 37 seconds to 28.5 seconds (+23% faster).

The second hotspot is related to Matches instantiation (which happens several times for children matches). The BaseMatch contains a collection of dictionaries in order to fast access matches by index, name, tag, etc. All these dictionaries are instantiated as soon as a Matches object is created and all of them are populated as soon as matches are added to it. I tried some experimental code to only instantiate and populate these dictionaries when they are first needed, since not all rules will access all dictionaries from all matches. That change reduced the total time from 28.5 seconds to 22.5 seconds (+21% faster).

I did run guessit test suite using these modifications as well another test suite that I have. All of them remain green

Hope this can be useful.

Toilal commented 7 years ago

Great !

Only one question before merging : Is the "Handle unused kwargs" commit really useful ?

ratoaq2 commented 7 years ago

pylint was failing. I had no time to check how to properly fix that.

Toilal commented 7 years ago

Ok I see ! I'll merge, but removing this commit add adding a pylint ignore comment instead. Thank you, it's a great job !

Toilal commented 7 years ago

On my computer, guessit unit tests runs 2 times faster with this version on python 3.5.1, from 25s to 12.5s. Using python 2.7.11, the benefits are less revelant but still there from 15s to 12.5s.

Toilal commented 7 years ago

Released in rebulk 0.8.2