chokkan / crfsuite

CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
http://www.chokkan.org/software/crfsuite/
Other
641 stars 208 forks source link

Exclude sentence with only O #102

Open ericcanadas opened 6 years ago

ericcanadas commented 6 years ago

Hi,

More a question than an issue. Is it useful to leave sentences that contain only O in the training set ? Exemple : (here, the sentence, "The dog is brown")

EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

The O
dog O
is O
brown O

Peter B-PER
Blackburn I-PER
usptact commented 6 years ago

@EricC91 Yes, it is beneficial to keep the sentence with only "O" labels. Those are so-called negative examples. Having the select negatives in your training set makes your model much more robust to false positives (tagging where the model should not tag).

From my experience building many custom NER models, it is beneficial to add negative examples in small batches. The ones you add in the current iteration are the ones where the model tags. After a couple of iterations on some random examples, your model will learn pretty quickly. The added examples must be diverse.