Closed lawctan closed 1 year ago
The retokenizer can merge different morph features like A=1
+ B=2
-> A=1|B=2
, but it doesn't know how to automatically merge multiple values for the same feature like A=1
+ A=2
-> A=???
, so it uses the value from one token instead of trying to merge them. I'd have to double-check to be sure, but I think the default is to take the value from the head token in the phrase, and if there's no parse then it's taken from the first token.
As a workaround, you can merge the values using your own custom method before retokenizing. Set the same value on all tokens in the entity/span to be sure that this value gets used for the new token:
from spacy.tokens import MorphAnalysis
span = doc[0:2]
reading = "".join([token.morph.get("Reading")[0] for token in span])
for token in span:
morph_dict = token.morph.to_dict()
morph_dict["Reading"] = reading
token.morph = MorphAnalysis(nlp.vocab, morph_dict)
Let me move this to a discussion...
How to reproduce the behaviour
Command to test
echo "4月1日に試験があるので" | python parse-jap.py
returns
Note how for 4月1日, it shows morph": "Reading=ツイタチ". It removed the reading from 4月
Your Environment