Closed khonzoda closed 2 years ago
Thanks, I can confirm this is a bug. There are two problems here:
ROOT
not dep
This is due to the new attribute ruler rule in v3.2.0 that switches any whitespace relations to dep
.
The stopgap solution would be to modify this rule that so that it doesn't modify ROOT
labels.
The actual solution would be to include data augmentation so that whitespace tokens are handled better by the parser in the trained pipelines, which don't have any whitespace tokens in their training data.
I've updated this in our internal training data repo and attached the full English rules for reference, which can be used to update an existing pipeline (rename to .json
):
import spacy
import srsly
nlp = spacy.load("en_core_web_sm")
patterns = srsly.read_json("ar_patterns.json")
nlp.remove_pipe("attribute_ruler")
ar = nlp.add_pipe("attribute_ruler")
ar.add_patterns(patterns)
There are two changes related to the whitespace rules, one change that skips space tokens without any dep labels (from another issue) and one that skips ROOT labels. Here's just the diff for reference, in case you'd like to update the same rules for another language:
"attrs":{
"TAG":"_SP",
"POS":"SPACE",
- "MORPH":"_",
+ "MORPH":"_"
+ },
+ "index":0
+ },
+ {
+ "patterns":[
+ [
+ {
+ "IS_SPACE":true,
+ "DEP":{"NOT_IN": ["", "ROOT"]}
+ }
+ ]
+ ],
+ "attrs":{
"DEP":"dep"
},
"index":0
}
This will be fixed more generally through whitespace augmentation for the v3.3 models. The whitespace augmenter was added in #10170.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
With the new version of spaCy (3.2.1) it seems that some sentences do not have ROOT among dependency relation markers of tokens. Nonetheless, these sentences still have root when sent.root is called. It seems that this behavior is related to instances, when whitespace token is parsed as root. Consider the following example:
When I use spacy=3.2.1, these lines of code print out:
On the other hand, when I use spacy=3.1.4 I get:
With both versions of spaCy, calling sent.root identifies the whitespace (id=2) as root. However, in v3.1.4 this whitespace has ROOT as
token.dep_
, while in v3.2.1token.dep_
is dep, and for no tokens in the sentencetoken.dep_
is ROOT. We wonder if the behavior of a sentence having root when there are no tokens with ROOT dependency relation marker is intended in spaCy 3.2.1.Your Environment