aalto-speech / morfessor

Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
http://morpho.aalto.fi
BSD 2-Clause "Simplified" License
180 stars 27 forks source link

KeyError #10

Closed Gldkslfmsd closed 6 years ago

Gldkslfmsd commented 6 years ago

Hello, I'm getting this issue:

p3/bin/morfessor -t en-cs/train.en.tok --num-morph-types 50000 -S morf-models/morf-model.train.en-cs.50k.en -s morf-model.train.en-cs.50k.pickle.en
INFO:morfessor.io:Reading corpus from 'en-cs/train.en.tok'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 1938261 types / 1938261 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0       Cost: 75567655.89912468
.......................................................ERROR:morfessor:Fatal Error <class 'KeyError'> 'lhjij'
Traceback (most recent call last):
  File "p3/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "p3/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/cmd.py", line 435, in main
    args.finish_threshold, args.maxepochs)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 595, in train_batch
    segments = self._recursive_optimize(w, *algorithm_params)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 299, in _recursive_optimize
    constructions += self._recursive_split(part)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 312, in _recursive_split
    rcount, count = self._remove(construction)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 124, in _remove
    rcount, count, splitloc = self._analyses[construction]
KeyError: 'lhjij'

Morfessor (2.0.3)

The input file is tokenized English side of CzEng. Is it correct?

svirpioj commented 6 years ago

There is nothing wrong with the command, so must be something with the data. Can you show an example what en-cs/train.en.tok looks like? A minimal example that produces the error would be great.

Gldkslfmsd commented 6 years ago

Thanks for reply. Here it is:

machacek@cosmos:/net/work/people/machacek/morf-seg-nmt$ cat en-cs/s
The Tanguts called their own state " phiow ¹ -bjij ² -lhjij-lhjij ² " which translates as " The Great State of the White and the Lofty . "
Since it was located in the west , the Chinese name is Xi-Xia ( 西夏 ) , literally " Western Xia , " and thus that name is often used in Sinological literature .
machacek@cosmos:/net/work/people/machacek/morf-seg-nmt$ p3/bin/morfessor -t en-cs/s
INFO:morfessor.io:Reading corpus from 'en-cs/s'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 46 types / 46 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0   Cost: 961.810623090051
.........................ERROR:morfessor:Fatal Error <class 'KeyError'> 'lhjij'
Traceback (most recent call last):
  File "p3/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "p3/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/cmd.py", line 435, in main
    args.finish_threshold, args.maxepochs)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 595, in train_batch
    segments = self._recursive_optimize(w, *algorithm_params)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 299, in _recursive_optimize
    constructions += self._recursive_split(part)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 312, in _recursive_split
    rcount, count = self._remove(construction)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 124, in _remove
    rcount, count, splitloc = self._analyses[construction]
KeyError: 'lhjij'
svirpioj commented 6 years ago

Thanks! This looks like a bug that is related to how forced splits around certain characters (by default hyphens) are handled. I found out that it affects specific types of pattern like "-lhjij-lhjij" (or more generally (\F.{2-}).*\1, where \F is any character in the force split list).

While we are fixing this, you can use --forcesplit "" to disable forced splitting for hyphens.

Gldkslfmsd commented 6 years ago

While we are fixing this, you can use --forcesplit "" to disable forced splitting for hyphens.

Does it get exactly same output for all other files with and without this option? I want all my corpora to be processed exactly the same way. Do I have to repeat the training?

svirpioj commented 6 years ago

Does it get exactly same output for all other files with and without this option? I want all my corpora to be processed exactly the same way. Do I have to repeat the training?

The model will naturally be somewhat different with and without forced splits, although hyphens are in any case split on most contexts. But forced splits are applied only during training, so once you have a model file, the option does not affect the viterbi segmentations produced by the model.

I assume that you are using the output for machine translation. In that case I would not use forced splits on hyphens anyway, but let the model decide whether to leave frequent word parts with hyphens unsegmented.

svirpioj commented 6 years ago

Fixed in 2.0.4.