Closed hiroshi-matsuda-rit closed 5 years ago
Have you had a look at retokenization in spaCy?
That allows you to update the attributes of tokens such as POS, using retokenization.merge
Sure. I've been using retokenization APIs.
In GiNZA, I'm using a logic which uses extended dependency labels; e.g. "obj_as_NOUN", to distinguish ambiguous POS, and also it uses a virtual root token appended after the last token in sentence to distinguish the POS of real root token; e.g. "root_as_VERB".
https://github.com/megagonlabs/ginza/blob/develop/ja_ginza/parse_tree.py#L445 https://github.com/megagonlabs/ginza/blob/feature/apply_spacy_v2.1/ja_ginza/parse_tree.py#L433
This tricky logic is much complicated and also reducing performance. I'd try to refactor it.
Thanks, @BreakBB
@BreakBB Actually this refers to the parser-based mechanism, which uses the subtok
label. This is a bit different from the retokenization.
@hiroshi-matsuda-rit In the command-line interface, it should be as simple as adding --learn-tokens
. The mechanism works like this:
GoldParse
class, we receive a pair (doc, annotations)
, where the annotations includes the gold-standard segmentation, and the doc
object contains the predicted tokenization. We then do a Levenshtein alignment between the two. The alignment is called in spacy/gold.pyx
, and the main logic is in spacy/_align.pyx
.subtok
. The head for these subtok
tokens will be the next word. This occurs in spacy/gold.pyx
.subtok
labels. Additional constraints on this label ensure that the parser can only predict subtok
for length-1 arcs, and that subtokens cannot cross sentence boundaries.doc.retokenize()
. This should be occurring in the merge_subtokens
pipeline component in v2.1.4. In the next release, this will be moved into parser.postprocesses
, to make the system more self-contained.It sounds to me like your system would benefit from having several ROOT
labels, which could be interpreted with different meanings. Currently the ROOT
label is hard-coded, which prevents this.
@honnibal I simply shared what I have found in the docs. Thanks for the clarification!
@honnibal Thank you so much for you precise description around subtok concatenation procedure. I decided to replace GiNZA's POS disambiguation and retokenization procedures with spaCy's POS tagger and --learn-tokens, respectively. The spaCy's train command works well with -G option but does not work with SudachiTokenizer (without -G). It seems that we should retokenize the dataset by the tokenizer in advance, to avoid the inconsistent situations. I encounter an error at the beginning of the first evaluation phase (just after first training phase).
python -m spacy train ja ja_gsd-ud ja_gsd-ud-train.json ja_gsd-ud-dev.json -p tagger,parser -ne 2 -V 1.2.2 -pt dep,tag -v models/ja_gsd-1.2.1/ -VV
...
✔ Saved model to output directory
ja_gsd-ud/model-final
⠙ Creating best model...
Traceback (most recent call last):
File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
losses=losses,
File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/language.py", line 457, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
File "arc_eager.pyx", line 592, in spacy.syntax.arc_eager.ArcEager.set_costs
ValueError: [E020] Could not find a gold-standard action to supervise the dependency parser. The tree is non-projective (i.e. it has crossing arcs - see spacy/syntax/nonproj.pyx for definitions). The ArcEager transition system only supports projective trees. To learn non-projective representations, transform the data before training and after parsing. Either pass `make_projective=True` to the GoldParse class, or use spacy.syntax.nonproj.preprocess_training_data.
I'd like to report how I'd solve this problem, soon.
Anyway, I think that a lot of applications of the world will be happy if they could use customized root labels.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Feature description
I'd like to implement a functionality which can correct tokenization errors (both boundaries and tags) by parser. With this error correction function, our Japanese language model will be able to resolve ambiguous POSs (such as サ変名詞 for NOUN or VERB) and merge over-segmented tokens.
I found a related mention in v2.1 release note.
Could you please give me the links for the source codes doing this? @honnibal Can we apply "joint word segmentation and parsing" for single (and possibly root) token span?