Speedups - Githubissues

fgregg commented 9 years ago

With a trained model, pyhacrf is spending all it's time in the _forward and _build_lattice functions. Are these parts of the code stable enough for speed optimizations?

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4437    0.535    0.000   51.148    0.012 pyhacrf.py:124(predict_proba)
     4437    0.026    0.000   31.088    0.007 pyhacrf.py:355(predict)
     4437    0.074    0.000   31.020    0.007 pyhacrf.py:360(_forward_probabilities)
     4437   26.067    0.006   30.942    0.007 pyhacrf.py:374(_forward)
     4437    0.111    0.000   19.417    0.004 pyhacrf.py:326(__init__)
     4437   14.329    0.003   19.305    0.004 pyhacrf.py:414(_build_lattice)
  2794029    4.299    0.000    4.299    0.000 {numpy.core._dotblas.dot}

dirko commented 9 years ago

Interesting that most of the time is spent there - last when I profiled the training time it seemed that numpy.core._dotblas.dot was the bottleneck and I concluded that further optimization will be difficult - but obviously it's not true now.

Since fast scoring is important now is a good time to look into it. _forward and _build_lattice are intimately linked so I guess they will have to be optimized together. I think they are stable enough - although alternative state machines hasn't been used much (they are tested though).

An alternative to trying to optimize these methods (which I think will be quite tricky) is to add an alternative scoring method. Instead of using the sum-product algorithm we can try Viterbi decoding - might be possible to implement better lattice search strategies/heuristics with possibly worse performance (although McCallum found Viterbi worked better in some cases).

But first I'll profile a bit and see what's possible - (from your profile it seems that 10% of the time is spent in the blas, which means that we can't really expect more than 10x speedup (at the very most) without changing the algorithm.)

fgregg commented 9 years ago

I'm still developing my understanding of the code, but it looks like there are no transitions, in the lattice between state 0 and state 1 (in the two state case). If we have N classes, could we only calculate the predicted probability for N-1 classes in _forward In the two class case, that could have the time.

dirko commented 9 years ago

Interesting idea - it kind of fundamentally changes the model though. It would be perfect if we can keep the old behaviour as well - I'll think a bit how to do it.

fgregg commented 9 years ago

Nearly all the calls to .dot can be avoided by precomputing the dot products, which will typically be pretty small, so the opportunity for speed up is much more than 10K.

https://github.com/dirko/pyhacrf/commit/02ddc35bbb52fb9275dccadb1591755609c4fa24

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4088    0.256    0.000   31.032    0.008 pyhacrf.py:124(predict_proba)
     4088    0.093    0.000   17.646    0.004 pyhacrf.py:326(__init__)
     4088   13.001    0.003   17.553    0.004 pyhacrf.py:435(_build_lattice)
     4087    0.015    0.000   13.065    0.003 pyhacrf.py:355(predict)
     4087    0.097    0.000   13.017    0.003 pyhacrf.py:360(_forward_probabilities)
     4087   11.392    0.003   12.917    0.003 pyhacrf.py:374(_forward)

dirko commented 9 years ago

That's neat! Makes a lot of sense.

Btw on line 403 (https://github.com/dirko/pyhacrf/commit/02ddc35bbb52fb9275dccadb1591755609c4fa24#diff-a463d967e7529d28ffb5703500397490R403) the two states s0 and s1 will differ if more complicated state machines than the default are used.

fgregg commented 9 years ago

Good to know that about the states!

So, with precomputing the dot product, and running kernprof, 30% of the time in _forward is spent on logaddexp, and the rest on python overheady stuff. That's the kind of thing that cython usually can help with a lot.

Timer unit: 1e-06 s

File: /home/fgregg/public/pyhacrf/pyhacrf/pyhacrf.py
Function: _forward at line 374
Total time: 18.0721 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   374                                               @profile
   375                                               def _forward(self, parameters):
   376                                                   """ Helper to calculate the forward weights.  """
   377      1798         8020      4.5      0.0          alpha = defaultdict(lambda: -np.inf)
   378      1798         4369      2.4      0.0          dot = np.dot
   379      1798         3395      1.9      0.0          logaddexp = np.logaddexp
   380      1798         3299      1.8      0.0          x = self.x
   381      1798       852929    474.4      4.7          k = sorted(set((node[0], node[1]) for node in self._lattice))
   382      1798         4095      2.3      0.0          if not k :
   383         1            1      1.0      0.0              return alpha
   384                                           
   385      1797        32064     17.8      0.2          I, J = zip(*k)
   386      1797         3612      2.0      0.0          all_scores = {}
   387                                           
   388                                                   #import pdb
   389                                                   #pdb.set_trace()
   390                                           
   391      1797         3681      2.0      0.0          all_scores = {node : score for
   392                                                                 node, score in
   393      1797       235808    131.2      1.3                        zip(k, dot(x[I, J, :], parameters.T))}
   394                                                   
   395   1111007      2081324      1.9     11.5          for node in self._lattice:
   396   1109211      2154515      1.9     11.9              if len(node) == 3:
   397    309778       564027      1.8      3.1                  i, j, s = node
   398    309778       550805      1.8      3.0                  if i == 0 and j == 0:
   399      3594        11270      3.1      0.1                      alpha[node] = all_scores[(i,j)][s]
   400                                                           else:
   401    306184       965576      3.2      5.3                      alpha[node] += all_scores[(i,j)][s]
   402                                                       else:
   403    799433      1538643      1.9      8.5                  i0, j0, s0, i1, j1, s1, edge_parameter_index = node  
   404                                                           # Actually an edge in this case
   405                                                           # Use the features at the destination of the edge.
   406    799433      1760960      2.2      9.7                  edge_potential = (all_scores[(i1, j1)][edge_parameter_index]
   407    799433      2030140      2.5     11.2                                    + alpha[(i0, j0, s0)])
   408                                                           #alpha[node] = edge_potential
   409    799432      5260476      6.6     29.1                  alpha[(i1, j1, s1)] = logaddexp(alpha[(i1, j1, s1)], edge_potential)
   410      1796         3083      1.7      0.0          return alpha

dirko commented 9 years ago

Agreed.

I think it might also help a lot to store the lattice in some other data structure - might fit into a numpy array. So instead of a list of tuples there can be 2 dimensional array containing indices in another array.

Why is line 408 above commented out?

fgregg commented 9 years ago

Sounds good.

For line 408, I couldn't find any place that used the alphas for the edge nodes.

On Tue, May 12, 2015 at 12:22 PM dirko notifications@github.com wrote:

Agreed.

I think it might also help a lot to store the lattice in some other data structure - might fit into a numpy array. So instead of a list of tuples there can be 2 dimensional array containing indices in another array.

Why is line 408 above commented out?

— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-101357175.

dirko commented 9 years ago

Ah you're right - the edge alphas are not used during scoring, only training.

I'll split _forward up into a training _forward and scoring _forward a bit later. For now, because that line only accounts for 8% of the run time, I'll keep it as one method to simplify the refactoring.

fgregg commented 9 years ago

makes sense.

On Tue, May 12, 2015 at 1:53 PM dirko notifications@github.com wrote:

Ah you're right - the edge alphas are not used during scoring, only training.

I'll split _forward up into a training _forward and scoring _forward a bit later. For now, because that line only accounts for 8% of the run time, I'll keep it as one method to simplify the refactoring.

— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-101386208.

fgregg commented 9 years ago

With my latest PR , we are we re really spending all our time in build_lattice

     2042    0.017    0.000   21.931    0.011 __init__.py:34(__call__)
     2042    0.317    0.000   18.937    0.009 pyhacrf.py:125(predict_proba)
     2042    0.148    0.000   17.553    0.009 pyhacrf.py:217(__init__)
     2042   12.632    0.006   17.404    0.009 pyhacrf.py:287(_build_lattice)

What ideas do you have for speeding up that method?

It seems like some things may be possible if we knew that the transitions were always local (1, 1), (0,1), and (1,0). Also, is it ever the case that State 2 is different than State 1?

fgregg commented 9 years ago

Here's an idea for _build_lattice.

Transitions can be separated into those that are dependent on the pair of strings and those that are independent.

For the independent transitions we can precalculate a large lattice, say equivalent to two strings with 50 characters. For a particular pair of real strings, if they are both shorter than 50 characters, we just subset the precalculated lattice. If one of them is larger, we use the precalculated lattice as the basis.

When we have transitions that are dependent upon the pair of strings, we add those edges, as necessary, to the precalculated lattice.

So, in build_lattice we would just

subsetting the precalculated lattice for typical size strings
adding data dependent transition to the base lattice
growing the lattice, if we have very large strings.

dirko commented 8 years ago

This is a cool idea!

I'd just like to retain the original very general way to build lattices as well though, maybe in its own class.

dirko commented 8 years ago

For another way of looking at the problem please see this notebook. I've taken your examples and tokenised them - then extracted features for each token instead of each character. As you can expect this speeds things up tremendously.

For your problem I think extensive feature engineering will have to happen in any case - I don't think we can expect that just adding character features will solve the problem even after showing it many examples.

Do you maybe have more examples that we can try features on, and more ideas for features? One way is to just list the reasons that humans would know the correct label - for example if the two words in a lattice position is synonyms then that would give evidence of a match.

fgregg commented 8 years ago

That sounds excellent. I have lot of training data I can pull together.

fgregg commented 8 years ago

@dirko, here are some more training examples along the lines of highered https://gist.github.com/anonymous/83ebecc63ddf2f981894

dirko commented 8 years ago

Cool that's perfect - let's see what works.

Do you mind if I write a blog post about the process using some of the data?

fgregg commented 8 years ago

Sounds really cool! I'm going to promote this awesome project, once it's ready to integrate into dedupe.

On Wed, Jun 10, 2015 at 1:00 PM dirko notifications@github.com wrote:

Cool that's perfect - let's see what works.

Do you mind if I write a blog post about the process using some of the data?

— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-110856538.

fgregg commented 8 years ago

Hey @dirko,

So I think we are nearing the end of the line for what can be done in python land.

We are now spending over half our time in the c code:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.239    0.239   64.424   64.424 speed.py:1(<module>)
   100000    0.491    0.000   59.043    0.001 __init__.py:62(__call__)
   100000    1.255    0.000   48.388    0.000 pyhacrf.py:134(predict_proba)
   100000    2.290    0.000   39.460    0.000 pyhacrf.py:266(predict)
   100000   31.271    0.000   31.271    0.000 {pyhacrf.algorithms.forward_predic

About 25 of those 31 seconds are spent in logaddexp.

I'm interested in using sse_math.h which has very optimized versions of log and exp. Unfortunately, this is a little bit beyond me right now (I never really learned C).

In case you are interested, here are some resources:

It looks like there's maybe a 25%-35% opportunity here. Any interest in doing this?

dirko commented 8 years ago

I've recently come across SSE and it looked like something that is worthwhile knowing, so I have some interest in learning about it.

It seems, however, that the largest speedups would be for vectorised operations, and the inner logaddexp in our case isn't vectorised. There might be a way to vectorise whole sets of examples but I think it would be tricky.

If you are mainly interested in speeding up scoring (ie not training), then the max-sum (Viterbi) algorithm might give results that are about as good as the current sum-product. I'll add a predict_max method and then we can see how well that works.

fgregg commented 8 years ago

Cool!, I was wondering whether that algorithm would be applicable, and I wasn't sure since it seemed like all paths contibute to the final score.

dirko commented 8 years ago

Looks like exchanging logaddexp for max isn't really faster, and the accuracy is worse:

Logaddexp

[[2643   51]
 [  98 2562]]
2.783% error
1 loops, best of 3: 1min 4s per loop

Max

[[2665   29]
 [ 425 2235]]
8.48% error
1 loops, best of 3: 1min 4s per loop

[I'll push the changes so you can play with it if you want to]

At this stage I think I'll probably only revisit SSE if it comes up somewhere else - it seems that things like tokenising and generating features per token will be quicker wins here (also I suspect you'll get better performance for the address problem)

fgregg commented 8 years ago

Thanks for looking into that @dirko. Why don't you push your changes as branch. What you say about SSE makes a lot of sense. Wasn't hip to the fact that you only saw benefits if you vectorized.

fgregg commented 8 years ago

Wasn't there also an idea for using levenshtein distance as a speedup?

fgregg commented 8 years ago

Hi @dirko, I'm really surprised that you are seeing such little differences between logaddexp and max. For me, it is a difference of 14.2.

import highered
ed = highered.CRFEditDistance()

for _ in range(100000) :
    ed('vn;ad89dp808vasdfdas', '23qrnqwjnczx98qbwjedadfa')

This is about 14 seconds with logaddexp and 2 seconds with max.

I agree with your analysis that sse logsumexp probably won't help us here because it would be very tricky to vectorize the code to take advantage of that.

I wonder if we might be able to take advantage of the cephes library here. https://github.com/jeremybarnes/cephes/blob/60f27df395b8322c2da22c83751a2366b82d50d1/cmath/unity.c#L56-L67

The current logaddexp is just:

cdef double c_logaddexp(double x, double y) nogil :
    cdef double tmp
    if x == y :
        return x + log1p(2)
    else :
        tmp = x - y
        if tmp > 0 :
            return x + log1p(exp(-tmp))
        elif tmp <= 0 :
            return y + log1p(exp(tmp))
        else :
            return tmp

So if we can get a faster log1p and a faster exp will be in great shape.

fgregg commented 8 years ago

http://mathforum.org/kb/message.jspa?messageID=5622534

dirko commented 8 years ago

Hi @fgregg, that's quite a speedup - will try to check it out. Unfortunately I'm a bit pressed for time at them moment with other projects, so I don't think I'll be able to put in too much time with these types of optimisations.

Btw have you maybe tried tokenising and generating features per token?

fgregg commented 8 years ago

Got it! I'll try the cephes.

On Tue, Dec 8, 2015 at 1:06 AM, dirko notifications@github.com wrote:

Hi @fgregg https://github.com/fgregg, that's quite a speedup - will try to check it out. Unfortunately I'm a bit pressed for time at them moment with other projects, so I don't think I'll be able to put in too much time with these types of optimisations.

Btw have you maybe tried tokenising and generating features per token?

— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-162796498.

773.888.2718

fgregg commented 8 years ago

I cant think of any more speedups. Can only open new issues, if they come up.

dirko / pyhacrf

Speedups #2

Logaddexp

Max