explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.39k stars 4.42k forks source link

unexpected label produced by custom-trained dependency parser #5405

Closed fresejoerg closed 4 years ago

fresejoerg commented 4 years ago

I am using the CLI training interface to train a custom tagger and parser. The dependency labels are a custom set of semantic labels. The training data is converted from conll format. I am not using the --base-model argument, so I believe I'm starting from a blank model. Also, the output directory does not exist prior to training. After training, the model sometimes outputs an unexpected dependency tag ('dep') which is not part of my training data.

Info about spaCy

This issue links to this stackoverflow question.

with open("/path/to/my/train_data.json", 'r') as j:
    contents_train = json.load(j)

with open("/path/to/my/dev_data.json", 'r') as j:
    contents_dev = json.load(j)

contents = contents_train + contents_dev

labels = {}
for c in contents:
    for p in c['paragraphs']:
        for s in p['sentences']:
            for t in s['tokens']:
                if t['dep'] in labels:
                    labels[t['dep']] += 1
                else:
                    labels[t['dep']] = 1

print(labels)
{'compound': 139, 'ROOT': 171, '-': 386, 'modification': 122, 'proximity': 65, 'quality': 36, 'feature': 77, 'containment': 65, 'cuisine': 10, 'availability': 10, 'timing': 10, 'pricing': 4, 'negation': 3, 'directional': 6, 'destination': 16, 'attachment': 1, 'origin': 7, 'access': 1, 'accessibility': 1, 'quantification': 2, 'tmode': 2}

# train the model
! python -m spacy train en full_model_trained_custSem \
/path/to/my/train_data.json \
/path/to/my/dev_data.json \
--pipeline 'tagger,parser' \
--gold-preproc

# load trained model
nlp = spacy.load('full_model_trained_custSem/model-best')

# test trained model
q = "best deli in Seattle"
rel_list = []
for t in nlp(q):
    rel_list.append(t.text+' <-- '+t.dep_+' -- '+t.head.text)
print(rel_list)
['best <-- dep -- deli', 'deli <-- ROOT -- deli', 'in <-- - -- deli', 'Seattle <-- containment -- deli']
adrianeboyd commented 4 years ago

I'm not entirely sure, but I think this may be a bug related to using - as a label because the parser (which is used underneath for both parser and ner) has some special cases related to - to process labels like B-ORG for NER.

Do you have the same results if you replace - with a different dummy label like none?

fresejoerg commented 4 years ago

Thanks! I replaced - with other and haven't observed the issue again, so I think your intuition was correct.

adrianeboyd commented 4 years ago

Glad to hear it! I have to look into the details about how complicated it might be to fix the parser directly (since you really should be able to use any string label), but at least for now we should show an error when training data is loaded with - in a dependency label to avoid this problem.

A side note: unless you are only processing single sentences with your model when it's in use, I would recommend against using --gold-preproc. If you use this option, the parser won't learn to split sentences because it also splits the training data up into individual sentences while training and the parser never sees any sentence boundaries.

If you want to train with gold tokenization, then just remove the "raw" texts from your training data (if you have them) and it will learn from the gold tokens without splitting up documents. (Gold tokenization and single-sentence training ended up grouped together in this option for specific kinds of parser evaluations, when separate options would been better.)

fresejoerg commented 4 years ago

Thanks for the pointer regarding --gold-preproc. I don't have the raw string in my data and the inputs are relatively short, non-grammatical fragments (search queries, in particular), so I'm not concerned about sentence splitting at the moment. But I'll do some experiments to see how omitting --gold-preproc affects LAS.

fresejoerg commented 4 years ago

The original issue just re-emerged for me. A freshly trained version of my model is predicting dep as a label. I verified that no - labels are in my training or dev dataset. So it appears that the root cause is unrelated to special handling for this character.

adrianeboyd commented 4 years ago

Hmm, what is nlp.get_pipe("parser").labels for your model? Are there any warnings/errors for your data with spacy debug-data?

fresejoerg commented 4 years ago

this is what I get from nlp.get_pipe("parser").labels:

('ROOT',
 'compound',
 'containment',
 'dep',
 'feature',
 'modification',
 'other',
 'proximity',
 'quality')

which is a subset of all training labels (with the exception of 'dep', which is not in my training data.

Here's the output from debugging the training data:

============================= Dependency Parsing =============================
ℹ Found 260 sentences with an average length of 6.5 words.
⚠ The training data contains 1.06 sentences per document. When there
are very few documents containing more than one sentence, the parser will not
learn how to segment longer texts into sentences.
ℹ Found 3 nonprojective train sentences
ℹ 21 labels in train data
ℹ 27 labels in projectivized train data
'other' (567), 'ROOT' (256), 'compound' (209), 'modification' (171), 'feature'
(111), 'containment' (100), 'proximity' (97), 'quality' (56), 'destination'
(24), 'possession' (24), 'cuisine' (16), 'timing' (14), 'availability' (9),
'directional' (9), 'negation' (9), 'quantification' (7), 'origin' (7), 'pricing'
(6), 'tmode' (6), 'attachment' (2), 'distance' (1)
⚠ Low number of examples for label 'quantification' (7)
⚠ Low number of examples for label 'origin' (7)
⚠ Low number of examples for label 'availability' (9)
⚠ Low number of examples for label 'cuisine' (16)
⚠ Low number of examples for label 'pricing' (6)
⚠ Low number of examples for label 'timing' (14)
⚠ Low number of examples for label 'directional' (9)
⚠ Low number of examples for label 'negation' (9)
⚠ Low number of examples for label 'tmode' (6)
⚠ Low number of examples for label 'attachment' (2)
⚠ Low number of examples for label 'distance' (1)
⚠ Low number of examples for 6 labels in the projectivized dependency
trees used for training. You may want to projectivize labels such as punct
before training in order to improve parser performance.
⚠ Projectivized labels with low numbers of examples:
other||containment: 2 feature||containment: 1 containment||containment: 1
containment||compound: 1 other||other: 1 modification||other: 1
⚠ The following labels were found only in the train data:
feature||containment, timing, containment||containment, containment||compound,
modification||other, quantification, other||containment, other||other,
distance
To train a parser, your data should include at least 20 instances of each label.
⚠ Multiple root labels (ROOT, containment) found in training data.
spaCy's parser uses a single root label ROOT so this distinction will not be
available.
adrianeboyd commented 4 years ago

I noticed that one of our example scripts uses - as a label in for a similar case without issues, so it must be something else.

I don't understand why debug-data would show 21 labels but you don't end up with all of them in the model labels. How many training docs do you have? There was a minor issue where the parser peeked at the first 1000 examples instead of examining all of them when adding labels. This peeking is still in v2.2.4, but will be removed in v2.3.0 (to be released soon, change in #5456).

fresejoerg commented 4 years ago

My sense is that the number of training examples is related to this issue, but probably not to the one you're referencing. The debug-data output above is for a training data set with 260 examples. I've since added some examples and have re-trained the model with 344 training examples, but still well below 1,000.

I noticed that there were previously two labels (destination and possession) which didn't make it into the model but also did not receive a low number of examples warning. These two labels are now in the model.

Unfortunately, adding more training data didn't prevent dep from showing up in the model. Looking at specific examples where the model actually predicts dep, I noticed that that occurs in cases where either the correct label would have been one of those with a low number of examples warning or in edge cases where even a human expert can't confidently assign the correct label.

adrianeboyd commented 4 years ago

I think I figured out what's going on. There's a minimum label frequency parameter with a default value of 30, which explains why some labels are missing in the model. You can lower this by passing the parameter min_action_freq to Parser.begin_training. debug-data uses a cutoff of 20 instead of 30, which is confusing here.

The dep label is coming from here as a backoff if there's no other good move:

https://github.com/explosion/spaCy/blob/925e93857034c29c46a8b582db4969df7ba50c06/spacy/syntax/arc_eager.pyx#L374-L376

spacy v2 has a number of parameters and defaults that are spread across the code and hard to track down. The rewrite of thinc for spacy v3 uses a much better config system where models can be saved with a complete config file and there shouldn't be as many frustrating issues with hidden defaults.

fresejoerg commented 4 years ago

Thanks for sticking with this. I'm closing this issue as resolved.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.