chokkan / crfsuite

CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
http://www.chokkan.org/software/crfsuite/
Other
641 stars 208 forks source link

Forced decoding support for partial labelled sequence ? #96

Open Pantamis opened 6 years ago

Pantamis commented 6 years ago

First, thank you for this wonderful lib !

I think CRFSuite is one of the only lib which can learn different kind of features given the label during training (update different weights during training depending of the label)

I try to use CRF for unusual language data. In particular, some labels are so specific that i can get them simply using regex. It means that I can have access to the true labels of parts of my sequences even during prediction step. Wapiti support what they called 'Forced decoding' : https://wapiti.limsi.fr/manual.html#forced The principle is to improve decoding through the knowledge of true labels by running Viterbi conditionally to inputs and known labels.

I think it could be a really powerful combination for this lib with the feature selection given label during training as I explained to include rules prior on the label sequence in the CRF model.

I wish I could contribute but C is not my cup of tea, can we imagine a such feature for your lib in the future ?

Thank you again for this nice work !

usptact commented 6 years ago

You might be interested in https://github.com/Oneplus/partial-crfsuite

Pantamis commented 6 years ago

Thank you very much for your answer.

This lib looks also very nice but I think it is not what i was talking about (even if a such feature is very interesting !). Here it uses partially labeled sequences for learning with sequences for which you don't have all the labels. I would like to use the labels eventually known during testing to improve the prediction of a given sequence.

The forced decoding is coded in wapiti by removing some feature before the viterbi decoding : https://github.com/Jekub/Wapiti/blob/569fbe5040583086f8d26667f6b793dc641536b0/src/decoder.c#L183

From what I understand in the code of CRFsuite, the model should store the features' set somewhere (but I didn't really find out where and how for the moment). Removing the features about different labels than the one we are observing temporary for forced decoding should not be too hard.

It could be great to have this feature in CRFsuite !

Maybe I will do it by myself one day.