Closed fgregg closed 8 years ago
Interesting that most of the time is spent there - last when I profiled the training time it seemed that numpy.core._dotblas.dot
was the bottleneck and I concluded that further optimization will be difficult - but obviously it's not true now.
Since fast scoring is important now is a good time to look into it. _forward
and _build_lattice
are intimately linked so I guess they will have to be optimized together. I think they are stable enough - although alternative state machines hasn't been used much (they are tested though).
An alternative to trying to optimize these methods (which I think will be quite tricky) is to add an alternative scoring method. Instead of using the sum-product algorithm we can try Viterbi decoding - might be possible to implement better lattice search strategies/heuristics with possibly worse performance (although McCallum found Viterbi worked better in some cases).
But first I'll profile a bit and see what's possible - (from your profile it seems that 10% of the time is spent in the blas
, which means that we can't really expect more than 10x speedup (at the very most) without changing the algorithm.)
I'm still developing my understanding of the code, but it looks like there are no transitions, in the lattice between state 0 and state 1 (in the two state case). If we have N classes, could we only calculate the predicted probability for N-1 classes in _forward
In the two class case, that could have the time.
Interesting idea - it kind of fundamentally changes the model though. It would be perfect if we can keep the old behaviour as well - I'll think a bit how to do it.
Nearly all the calls to .dot
can be avoided by precomputing the dot products, which will typically be pretty small, so the opportunity for speed up is much more than 10K.
https://github.com/dirko/pyhacrf/commit/02ddc35bbb52fb9275dccadb1591755609c4fa24
ncalls tottime percall cumtime percall filename:lineno(function)
4088 0.256 0.000 31.032 0.008 pyhacrf.py:124(predict_proba)
4088 0.093 0.000 17.646 0.004 pyhacrf.py:326(__init__)
4088 13.001 0.003 17.553 0.004 pyhacrf.py:435(_build_lattice)
4087 0.015 0.000 13.065 0.003 pyhacrf.py:355(predict)
4087 0.097 0.000 13.017 0.003 pyhacrf.py:360(_forward_probabilities)
4087 11.392 0.003 12.917 0.003 pyhacrf.py:374(_forward)
That's neat! Makes a lot of sense.
Btw on line 403 (https://github.com/dirko/pyhacrf/commit/02ddc35bbb52fb9275dccadb1591755609c4fa24#diff-a463d967e7529d28ffb5703500397490R403) the two states s0 and s1 will differ if more complicated state machines than the default are used.
Good to know that about the states!
So, with precomputing the dot product, and running kernprof, 30% of the time in _forward
is spent on logaddexp, and the rest on python overheady stuff. That's the kind of thing that cython usually can help with a lot.
Timer unit: 1e-06 s
File: /home/fgregg/public/pyhacrf/pyhacrf/pyhacrf.py
Function: _forward at line 374
Total time: 18.0721 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
374 @profile
375 def _forward(self, parameters):
376 """ Helper to calculate the forward weights. """
377 1798 8020 4.5 0.0 alpha = defaultdict(lambda: -np.inf)
378 1798 4369 2.4 0.0 dot = np.dot
379 1798 3395 1.9 0.0 logaddexp = np.logaddexp
380 1798 3299 1.8 0.0 x = self.x
381 1798 852929 474.4 4.7 k = sorted(set((node[0], node[1]) for node in self._lattice))
382 1798 4095 2.3 0.0 if not k :
383 1 1 1.0 0.0 return alpha
384
385 1797 32064 17.8 0.2 I, J = zip(*k)
386 1797 3612 2.0 0.0 all_scores = {}
387
388 #import pdb
389 #pdb.set_trace()
390
391 1797 3681 2.0 0.0 all_scores = {node : score for
392 node, score in
393 1797 235808 131.2 1.3 zip(k, dot(x[I, J, :], parameters.T))}
394
395 1111007 2081324 1.9 11.5 for node in self._lattice:
396 1109211 2154515 1.9 11.9 if len(node) == 3:
397 309778 564027 1.8 3.1 i, j, s = node
398 309778 550805 1.8 3.0 if i == 0 and j == 0:
399 3594 11270 3.1 0.1 alpha[node] = all_scores[(i,j)][s]
400 else:
401 306184 965576 3.2 5.3 alpha[node] += all_scores[(i,j)][s]
402 else:
403 799433 1538643 1.9 8.5 i0, j0, s0, i1, j1, s1, edge_parameter_index = node
404 # Actually an edge in this case
405 # Use the features at the destination of the edge.
406 799433 1760960 2.2 9.7 edge_potential = (all_scores[(i1, j1)][edge_parameter_index]
407 799433 2030140 2.5 11.2 + alpha[(i0, j0, s0)])
408 #alpha[node] = edge_potential
409 799432 5260476 6.6 29.1 alpha[(i1, j1, s1)] = logaddexp(alpha[(i1, j1, s1)], edge_potential)
410 1796 3083 1.7 0.0 return alpha
Agreed.
I think it might also help a lot to store the lattice in some other data structure - might fit into a numpy array. So instead of a list of tuples there can be 2 dimensional array containing indices in another array.
Why is line 408 above commented out?
Sounds good.
For line 408, I couldn't find any place that used the alphas for the edge nodes.
On Tue, May 12, 2015 at 12:22 PM dirko notifications@github.com wrote:
Agreed.
I think it might also help a lot to store the lattice in some other data structure - might fit into a numpy array. So instead of a list of tuples there can be 2 dimensional array containing indices in another array.
Why is line 408 above commented out?
— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-101357175.
Ah you're right - the edge alphas are not used during scoring, only training.
I'll split _forward
up into a training _forward
and scoring _forward
a bit later. For now, because that line only accounts for 8% of the run time, I'll keep it as one method to simplify the refactoring.
makes sense.
On Tue, May 12, 2015 at 1:53 PM dirko notifications@github.com wrote:
Ah you're right - the edge alphas are not used during scoring, only training.
I'll split _forward up into a training _forward and scoring _forward a bit later. For now, because that line only accounts for 8% of the run time, I'll keep it as one method to simplify the refactoring.
— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-101386208.
With my latest PR , we are we re really spending all our time in build_lattice
2042 0.017 0.000 21.931 0.011 __init__.py:34(__call__)
2042 0.317 0.000 18.937 0.009 pyhacrf.py:125(predict_proba)
2042 0.148 0.000 17.553 0.009 pyhacrf.py:217(__init__)
2042 12.632 0.006 17.404 0.009 pyhacrf.py:287(_build_lattice)
What ideas do you have for speeding up that method?
It seems like some things may be possible if we knew that the transitions were always local (1, 1), (0,1), and (1,0). Also, is it ever the case that State 2 is different than State 1?
Here's an idea for _build_lattice
.
Transitions can be separated into those that are dependent on the pair of strings and those that are independent.
For the independent transitions we can precalculate a large lattice, say equivalent to two strings with 50 characters. For a particular pair of real strings, if they are both shorter than 50 characters, we just subset the precalculated lattice. If one of them is larger, we use the precalculated lattice as the basis.
When we have transitions that are dependent upon the pair of strings, we add those edges, as necessary, to the precalculated lattice.
So, in build_lattice
we would just
This is a cool idea!
I'd just like to retain the original very general way to build lattices as well though, maybe in its own class.
For another way of looking at the problem please see this notebook. I've taken your examples and tokenised them - then extracted features for each token instead of each character. As you can expect this speeds things up tremendously.
For your problem I think extensive feature engineering will have to happen in any case - I don't think we can expect that just adding character features will solve the problem even after showing it many examples.
Do you maybe have more examples that we can try features on, and more ideas for features? One way is to just list the reasons that humans would know the correct label - for example if the two words in a lattice position is synonyms then that would give evidence of a match.
That sounds excellent. I have lot of training data I can pull together.
@dirko, here are some more training examples along the lines of highered https://gist.github.com/anonymous/83ebecc63ddf2f981894
Cool that's perfect - let's see what works.
Do you mind if I write a blog post about the process using some of the data?
Sounds really cool! I'm going to promote this awesome project, once it's ready to integrate into dedupe.
On Wed, Jun 10, 2015 at 1:00 PM dirko notifications@github.com wrote:
Cool that's perfect - let's see what works.
Do you mind if I write a blog post about the process using some of the data?
— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-110856538.
Hey @dirko,
So I think we are nearing the end of the line for what can be done in python land.
We are now spending over half our time in the c code:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.239 0.239 64.424 64.424 speed.py:1(<module>)
100000 0.491 0.000 59.043 0.001 __init__.py:62(__call__)
100000 1.255 0.000 48.388 0.000 pyhacrf.py:134(predict_proba)
100000 2.290 0.000 39.460 0.000 pyhacrf.py:266(predict)
100000 31.271 0.000 31.271 0.000 {pyhacrf.algorithms.forward_predic
About 25 of those 31 seconds are spent in logaddexp
.
I'm interested in using sse_math.h
which has very optimized versions of log
and exp
. Unfortunately, this is a little bit beyond me right now (I never really learned C
).
In case you are interested, here are some resources:
It looks like there's maybe a 25%-35% opportunity here. Any interest in doing this?
I've recently come across SSE and it looked like something that is worthwhile knowing, so I have some interest in learning about it.
It seems, however, that the largest speedups would be for vectorised operations, and the inner logaddexp
in our case isn't vectorised. There might be a way to vectorise whole sets of examples but I think it would be tricky.
If you are mainly interested in speeding up scoring (ie not training), then the max-sum (Viterbi) algorithm might give results that are about as good as the current sum-product. I'll add a predict_max
method and then we can see how well that works.
Cool!, I was wondering whether that algorithm would be applicable, and I wasn't sure since it seemed like all paths contibute to the final score.
Looks like exchanging logaddexp
for max
isn't really faster, and the accuracy is worse:
[[2643 51]
[ 98 2562]]
2.783% error
1 loops, best of 3: 1min 4s per loop
[[2665 29]
[ 425 2235]]
8.48% error
1 loops, best of 3: 1min 4s per loop
[I'll push the changes so you can play with it if you want to]
At this stage I think I'll probably only revisit SSE if it comes up somewhere else - it seems that things like tokenising and generating features per token will be quicker wins here (also I suspect you'll get better performance for the address problem)
Thanks for looking into that @dirko. Why don't you push your changes as branch. What you say about SSE makes a lot of sense. Wasn't hip to the fact that you only saw benefits if you vectorized.
Wasn't there also an idea for using levenshtein distance as a speedup?
Hi @dirko, I'm really surprised that you are seeing such little differences between logaddexp and max. For me, it is a difference of 14.2.
import highered
ed = highered.CRFEditDistance()
for _ in range(100000) :
ed('vn;ad89dp808vasdfdas', '23qrnqwjnczx98qbwjedadfa')
This is about 14 seconds with logaddexp and 2 seconds with max.
I agree with your analysis that sse logsumexp probably won't help us here because it would be very tricky to vectorize the code to take advantage of that.
I wonder if we might be able to take advantage of the cephes library here. https://github.com/jeremybarnes/cephes/blob/60f27df395b8322c2da22c83751a2366b82d50d1/cmath/unity.c#L56-L67
The current logaddexp is just:
cdef double c_logaddexp(double x, double y) nogil :
cdef double tmp
if x == y :
return x + log1p(2)
else :
tmp = x - y
if tmp > 0 :
return x + log1p(exp(-tmp))
elif tmp <= 0 :
return y + log1p(exp(tmp))
else :
return tmp
So if we can get a faster log1p
and a faster exp
will be in great shape.
Hi @fgregg, that's quite a speedup - will try to check it out. Unfortunately I'm a bit pressed for time at them moment with other projects, so I don't think I'll be able to put in too much time with these types of optimisations.
Btw have you maybe tried tokenising and generating features per token?
Got it! I'll try the cephes.
On Tue, Dec 8, 2015 at 1:06 AM, dirko notifications@github.com wrote:
Hi @fgregg https://github.com/fgregg, that's quite a speedup - will try to check it out. Unfortunately I'm a bit pressed for time at them moment with other projects, so I don't think I'll be able to put in too much time with these types of optimisations.
Btw have you maybe tried tokenising and generating features per token?
— Reply to this email directly or view it on GitHub https://github.com/dirko/pyhacrf/issues/2#issuecomment-162796498.
773.888.2718
I cant think of any more speedups. Can only open new issues, if they come up.
With a trained model, pyhacrf is spending all it's time in the
_forward
and_build_lattice
functions. Are these parts of the code stable enough for speed optimizations?