morph reading in token is not merged properly when using merge_entities pipeline

lawctan commented 1 year ago

How to reproduce the behaviour

import spacy
import json
import fileinput
from pprint import pprint

# returns start and end index, end not inclusive

def process(nlp, texts):
    docs = list(nlp.pipe(texts, n_process=1, batch_size=2000))
    for doc in docs:

        for sent in doc.sents:
            for token in sent:
                tokenInfo = {
                    "idx": token.i,
                    "orth": token.orth_,
                    "pos": token.pos_,
                    "lemma": token.lemma_,
                    "norm": token.norm_,
                    "dep": token.dep_,
                    "morph": token.morph.to_json(),
                }
                print(json.dumps(tokenInfo, ensure_ascii=False))

nlp = spacy.load('ja_core_news_lg')

nlp.add_pipe("merge_subtokens")
nlp.add_pipe("merge_entities")

texts = []

for line in fileinput.input():
    texts.append(line.strip())

process(nlp, texts)

Command to test

echo "４月１日に試験があるので" | python parse-jap.py

returns

{"idx": 0, "orth": "４月１日", "pos": "NOUN", "lemma": "4月1日", "norm": "４月１日", "dep": "obl", "morph": "Reading=ツイタチ"}
{"idx": 1, "orth": "に", "pos": "ADP", "lemma": "に", "norm": "に", "dep": "case", "morph": "Reading=ニ"}
{"idx": 2, "orth": "試験", "pos": "NOUN", "lemma": "試験", "norm": "試験", "dep": "nsubj", "morph": "Reading=シケン"}
{"idx": 3, "orth": "が", "pos": "ADP", "lemma": "が", "norm": "が", "dep": "case", "morph": "Reading=ガ"}
{"idx": 4, "orth": "ある", "pos": "VERB", "lemma": "ある", "norm": "有る", "dep": "ROOT", "morph": "Inflection=五段-ラ行;連体形-一般|Reading=アル"}
{"idx": 5, "orth": "の", "pos": "SCONJ", "lemma": "の", "norm": "の", "dep": "mark", "morph": "Reading=ノ"}
{"idx": 6, "orth": "で", "pos": "AUX", "lemma": "だ", "norm": "だ", "dep": "fixed", "morph": "Inflection=助動詞-ダ;連用形-一般|Reading=デ"}

Note how for ４月１日, it shows morph": "Reading=ツイタチ". It removed the reading from ４月

Your Environment

spaCy version: 3.5.3
Platform: macOS-12.5-arm64-arm-64bit
Python version: 3.10.10
Pipelines: ja_core_news_sm (3.2.0), ja_ginza (5.1.2), ja_core_news_trf (3.2.0), ja_ginza_electra (5.1.2), ja_core_news_lg (3.2.0)

adrianeboyd commented 1 year ago

The retokenizer can merge different morph features like A=1 + B=2 -> A=1|B=2, but it doesn't know how to automatically merge multiple values for the same feature like A=1 + A=2 -> A=???, so it uses the value from one token instead of trying to merge them. I'd have to double-check to be sure, but I think the default is to take the value from the head token in the phrase, and if there's no parse then it's taken from the first token.

As a workaround, you can merge the values using your own custom method before retokenizing. Set the same value on all tokens in the entity/span to be sure that this value gets used for the new token:

from spacy.tokens import MorphAnalysis
span = doc[0:2]
reading = "".join([token.morph.get("Reading")[0] for token in span])
for token in span:
    morph_dict = token.morph.to_dict()
    morph_dict["Reading"] = reading
    token.morph = MorphAnalysis(nlp.vocab, morph_dict)

adrianeboyd commented 1 year ago

Let me move this to a discussion...

explosion / spaCy