explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.84k stars 4.38k forks source link

Some sentences don't have ROOT among token dependency relation markers while there is still a root in the sentence #10003

Closed khonzoda closed 2 years ago

khonzoda commented 2 years ago

How to reproduce the behaviour

With the new version of spaCy (3.2.1) it seems that some sentences do not have ROOT among dependency relation markers of tokens. Nonetheless, these sentences still have root when sent.root is called. It seems that this behavior is related to instances, when whitespace token is parsed as root. Consider the following example:

test_string = "Well...  pretty much most of what was in the old table, and the category links"
nlp = spacy.load('en_core_web_sm')
sents = nlp(test_string).sents
for sent in sents:
    print('Sentence: "{}"'.format(sent))
    print('Sentence root: "{}". Root token id: {}'.format(sent.root, sent.root.i))
    print('-'*30)
    for token in sent:
        print('{}, {}, {}'.format(token.text, token.dep_, [("{}").format(p) for p in token.ancestors]))

When I use spacy=3.2.1, these lines of code print out:

Sentence: "Well...  pretty much most of what was in the old table, and the category links"
Sentence root: " ". Root token id: 2
------------------------------
Well, intj, [' ']
..., punct, [' ']
 , dep, []
pretty, advmod, ['much', 'most', ' ']
much, advmod, ['most', ' ']
most, appos, [' ']
of, prep, ['most', ' ']
what, nsubj, ['was', 'of', 'most', ' ']
was, pcomp, ['of', 'most', ' ']
in, prep, ['was', 'of', 'most', ' ']
the, det, ['table', 'in', 'was', 'of', 'most', ' ']
old, amod, ['table', 'in', 'was', 'of', 'most', ' ']
table, pobj, ['in', 'was', 'of', 'most', ' ']
,, punct, [' ']
and, cc, [' ']
the, det, ['links', ' ']
category, compound, ['links', ' ']
links, conj, [' ']

On the other hand, when I use spacy=3.1.4 I get:

Sentence: "Well...  pretty much most of what was in the old table, and the category links"
Sentence root: " ". Root token id: 2
------------------------------
Well, intj, [' ']
..., punct, [' ']
 , ROOT, []
pretty, advmod, ['much', 'most', ' ']
much, advmod, ['most', ' ']
most, npadvmod, [' ']
of, prep, ['most', ' ']
what, nsubj, ['was', 'of', 'most', ' ']
was, pcomp, ['of', 'most', ' ']
in, prep, ['was', 'of', 'most', ' ']
the, det, ['table', 'in', 'was', 'of', 'most', ' ']
old, amod, ['table', 'in', 'was', 'of', 'most', ' ']
table, pobj, ['in', 'was', 'of', 'most', ' ']
,, punct, [' ']
and, cc, [' ']
the, det, ['category', 'links', ' ']
category, nsubj, ['links', ' ']
links, conj, [' ']

With both versions of spaCy, calling sent.root identifies the whitespace (id=2) as root. However, in v3.1.4 this whitespace has ROOT as token.dep_, while in v3.2.1 token.dep_ is dep, and for no tokens in the sentence token.dep_ is ROOT. We wonder if the behavior of a sentence having root when there are no tokens with ROOT dependency relation marker is intended in spaCy 3.2.1.

Your Environment

honnibal commented 2 years ago

Thanks, I can confirm this is a bug. There are two problems here:

adrianeboyd commented 2 years ago

This is due to the new attribute ruler rule in v3.2.0 that switches any whitespace relations to dep.

The stopgap solution would be to modify this rule that so that it doesn't modify ROOT labels.

The actual solution would be to include data augmentation so that whitespace tokens are handled better by the parser in the trained pipelines, which don't have any whitespace tokens in their training data.

adrianeboyd commented 2 years ago

I've updated this in our internal training data repo and attached the full English rules for reference, which can be used to update an existing pipeline (rename to .json):

import spacy
import srsly
nlp = spacy.load("en_core_web_sm")
patterns = srsly.read_json("ar_patterns.json")
nlp.remove_pipe("attribute_ruler")
ar = nlp.add_pipe("attribute_ruler")
ar.add_patterns(patterns)

There are two changes related to the whitespace rules, one change that skips space tokens without any dep labels (from another issue) and one that skips ROOT labels. Here's just the diff for reference, in case you'd like to update the same rules for another language:

     "attrs":{
       "TAG":"_SP",
       "POS":"SPACE",
-      "MORPH":"_",
+      "MORPH":"_"
+    },
+    "index":0
+  },
+  {
+    "patterns":[
+      [
+        {
+          "IS_SPACE":true,
+          "DEP":{"NOT_IN": ["", "ROOT"]}
+        }
+      ]
+    ],
+    "attrs":{
       "DEP":"dep"
     },
     "index":0
   }

ar_patterns.json.txt

adrianeboyd commented 2 years ago

This will be fixed more generally through whitespace augmentation for the v3.3 models. The whitespace augmenter was added in #10170.

github-actions[bot] commented 2 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.