chokkan / crfsuite

CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
http://www.chokkan.org/software/crfsuite/
Other
641 stars 208 forks source link

store the argmax score in Viterbi inference in a stack variable #82

Closed albarrentine closed 7 years ago

albarrentine commented 7 years ago

Hey @chokkan,

Firstly, thanks for your many great contributions to open-source NLP!

In writing the CRF implementation for libpostal, which borrows heavily from CRFsuite's (ours required the ability to handle a data set larger than memory as well as joint state+transition features, so couldn't use CRFsuite verbatim), there was one simple optimization I found useful and wanted to contribute it back to this repo.

During the quadratic section of Viterbi inference, the backpointer array is used to store the best label index i for time t-1 that would make the transition to label index j for time t most likely. The current label j is a given during the inner loop, but whenever a candidate for the previous label exceeds the current max score, we have to make a trip to the heap to store the new backpointer. This results in O(N²) array accesses worst case, whereas we could formulate it slightly differently by defining the backpointer for index j as simply the argmax of the L scores computed. This way, we can store the argmax in a much cheaper-to-access stack variable and only touch the backpointer array once after the inner loop is complete, which guarantees only O(N) array accesses in total.

Normally worrying about such minutiae in C is not warranted, but since that is the quadratic section of the code and this is not the sort of thing the compiler can necessarily optimize out (unless maybe the pointer were declared with restrict), I thought it might be worthwhile to try. In my experiments with ~20 labels, this small optimization resulted in a ~2x performance increase in Viterbi inference. YMMV but larger gains should be possible with more labels. This speeds up both runtime tagging as well as averaged perceptron training, which does not need to compute derivatives and only requires running inference over the sequence. Using this method, I was able to train a CRF with the averaged perceptron on over 1 billion sequences (international street addresses) in time roughly equivalent to a greedy averaged perceptron.

Though most people are probably using LBFGS on smaller models, hopefully the runtime performance gains are useful to folks!

Cheers, Al

bratao commented 7 years ago

@thatdatabaseguy Thank you for you patch.

There is any chance that you also implement in crfsuite this state+transition features from libpostal ? It could be very useful to our case.

albarrentine commented 7 years ago

@bratao I don't want to be presumptuous :). State-transition features in crfsuite would be a little more involved, requiring either some sort of feature templating or introducing a separate "bucket" for the state-transition features similar to what we do in libpostal (i.e. an optional third column in the training file format).

chokkan commented 7 years ago

Thank you for the PR. Personally, I do not believe this makes a difference in performance with a modern compiler, but the submitted code is equivalent and cleaner.

Thank you for referring CRFsuite in libpostal.

albarrentine commented 7 years ago

Thanks Professor!

In my experiments, even with modern gcc/clang and -O3 this still resulted in around 2x faster Viterbi. I'm not an optimization expert by any means, but would suspect that while compilers can optimize the ordering of array assignments made in an inner loop (loop interchange in a matrix computation for example), they would be unable to optimize out the array assignment altogether (or do any vectorization because of the conditional statement if (max_score < score)).

It would also of course depend on the data and the labels i.e. in NER if the label at index 0 is the out label "O" (or generally the most common label in the data set), the back-pointer array should only be accessed N times in most cases anyway because the first score will usually be the max_score. This should help mostly for tasks like part-of-speech tagging or libpostal's address parsing where there are no background labels.

Thank you for merging in any case, and for making this amazing project available to practitioners!