Correcting tokenization errors by parser

hiroshi-matsuda-rit commented 5 years ago

Feature description

I'd like to implement a functionality which can correct tokenization errors (both boundaries and tags) by parser. With this error correction function, our Japanese language model will be able to resolve ambiguous POSs (such as サ変名詞 for NOUN or VERB) and merge over-segmented tokens.

I found a related mention in v2.1 release note.

Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.

Could you please give me the links for the source codes doing this? @honnibal Can we apply "joint word segmentation and parsing" for single (and possibly root) token span?

BreakBB commented 5 years ago

Have you had a look at retokenization in spaCy?

That allows you to update the attributes of tokens such as POS, using retokenization.merge

hiroshi-matsuda-rit commented 5 years ago

Sure. I've been using retokenization APIs.

In GiNZA, I'm using a logic which uses extended dependency labels; e.g. "obj_as_NOUN", to distinguish ambiguous POS, and also it uses a virtual root token appended after the last token in sentence to distinguish the POS of real root token; e.g. "root_as_VERB".

https://github.com/megagonlabs/ginza/blob/develop/ja_ginza/parse_tree.py#L445 https://github.com/megagonlabs/ginza/blob/feature/apply_spacy_v2.1/ja_ginza/parse_tree.py#L433

This tricky logic is much complicated and also reducing performance. I'd try to refactor it.

Thanks, @BreakBB

honnibal commented 5 years ago

@BreakBB Actually this refers to the parser-based mechanism, which uses the subtok label. This is a bit different from the retokenization.

@hiroshi-matsuda-rit In the command-line interface, it should be as simple as adding --learn-tokens. The mechanism works like this:

In the GoldParse class, we receive a pair (doc, annotations), where the annotations includes the gold-standard segmentation, and the doc object contains the predicted tokenization. We then do a Levenshtein alignment between the two. The alignment is called in spacy/gold.pyx, and the main logic is in spacy/_align.pyx.
When the predicted tokenization over-segments, we set the gold-standard label on the non-final tokens of an over-segmented region to the special label subtok. The head for these subtok tokens will be the next word. This occurs in spacy/gold.pyx.
The parser then learns to predict these subtok labels. Additional constraints on this label ensure that the parser can only predict subtok for length-1 arcs, and that subtokens cannot cross sentence boundaries.
After parsing, the subtokens are merged using doc.retokenize(). This should be occurring in the merge_subtokens pipeline component in v2.1.4. In the next release, this will be moved into parser.postprocesses, to make the system more self-contained.

It sounds to me like your system would benefit from having several ROOT labels, which could be interpreted with different meanings. Currently the ROOT label is hard-coded, which prevents this.

BreakBB commented 5 years ago

@honnibal I simply shared what I have found in the docs. Thanks for the clarification!

hiroshi-matsuda-rit commented 5 years ago

@honnibal Thank you so much for you precise description around subtok concatenation procedure. I decided to replace GiNZA's POS disambiguation and retokenization procedures with spaCy's POS tagger and --learn-tokens, respectively. The spaCy's train command works well with -G option but does not work with SudachiTokenizer (without -G). It seems that we should retokenize the dataset by the tokenizer in advance, to avoid the inconsistent situations. I encounter an error at the beginning of the first evaluation phase (just after first training phase).

python -m spacy train ja ja_gsd-ud ja_gsd-ud-train.json ja_gsd-ud-dev.json -p tagger,parser -ne 2 -V 1.2.2 -pt dep,tag -v models/ja_gsd-1.2.1/ -VV
...
✔ Saved model to output directory                                                                                                                                                         
ja_gsd-ud/model-final
⠙ Creating best model...
Traceback (most recent call last):
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/language.py", line 457, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
  File "arc_eager.pyx", line 592, in spacy.syntax.arc_eager.ArcEager.set_costs
ValueError: [E020] Could not find a gold-standard action to supervise the dependency parser. The tree is non-projective (i.e. it has crossing arcs - see spacy/syntax/nonproj.pyx for definitions). The ArcEager transition system only supports projective trees. To learn non-projective representations, transform the data before training and after parsing. Either pass `make_projective=True` to the GoldParse class, or use spacy.syntax.nonproj.preprocess_training_data.

I'd like to report how I'd solve this problem, soon.

hiroshi-matsuda-rit commented 5 years ago

Anyway, I think that a lot of applications of the world will be happy if they could use customized root labels.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Correcting tokenization errors by parser #3818

Feature description