chokkan / crfsuite

CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
http://www.chokkan.org/software/crfsuite/
Other
648 stars 208 forks source link

Question about active features #91

Open kite1988 opened 7 years ago

kite1988 commented 7 years ago

I used crfsuite to train a model for a named entity recognition task. I set the feature.minfreq to be 0 (no feature cut off), but I observed the number of active features (17450) is much smaller than the number of features (79075). Below is the snippet of the log:

Number of active features: 17450 (79075) Number of active attributes: 5310 (64323) Number of active labels: 21 (21)

Is any one know how the active features are selected? Another question, what are the differences between active features and active attributes? Thanks very much!

chokkan commented 7 years ago

CRFsuite removes features with zero weight assigned after finishing a training process. In your case, the number of features used in the training process was 79075, but only 17450 features have non-zero weights assigned by the training algorithm. For this reason, (79075-17450) features are removed from the model.

Roughly speaking, state features are pairs of attributes and labels. When a feature is removed from a model, there is also a possibility that the attribute associated with the feature is not referred to by any other feature and can be pruned. In your case, 5310 attributes are associated with features with non-zero weights, but the rest are with zero weights. For this reason, CRFsuite removed (64323-5310) attributes from the model.

I guess you used L1-regularization for training the model. It has a similar effect to setting a frequency cutoff.

arvinarvi commented 6 years ago

Is the tagging done using just the active features? How are the potentials computed for the tokens in the evaluation set which do not appear in the model file? Thanks for reply.

usptact commented 6 years ago

@arvinarvi The features which appear only in tagging mode but are not in the model, will get a weight of zero.

arvinarvi commented 6 years ago

@usptact Thank you for your reply.

arvinarvi commented 6 years ago

I am implementing a sequence labeling problem which extracts the learned potentials from the model file of CRFsuite and apply different inference algorithm. I am finding it difficult to generalize the extraction of potentials of state features from the saved model file for a particular token (in the evaluation set, if it is present in the model file) since only the active features are logged. Can anyone help me figure out the problem? (If more info is required, I can be very specific to my problem). Thanks.

marctorsoc commented 6 years ago

I don't understand @chokkan answer. If every feature is a pair attribute+label. Then there should be at least as many features as attributes. And in theory many more as an attribute might appear with different labels... can someone explain please?