bootphon / wordseg

A Python toolbox for text based word segmentation
https://docs.cognitive-ml.fr/wordseg
GNU General Public License v3.0
16 stars 7 forks source link

evaluation error #9

Closed cainesap closed 7 years ago

cainesap commented 7 years ago

Hello,

The wordseg pipeline works fine for me with ARPAbet input (thanks again, great resource!)

However with IPA input (e.g. from phonemizer / espeak) I encounter a problem:

If I run cat segmented.puddle.txt | wordseg-eval gold.txt > eval.puddle.txt I see the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/bin/wordseg-eval", line 11, in <module>
    load_entry_point('wordseg==0.4.1', 'console_scripts', 'wordseg-eval')()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/wordseg-0.4.1-py3.6.egg/wordseg/utils.py", line 68, in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/wordseg-0.4.1-py3.6.egg/wordseg/evaluate.py", line 174, in main
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/wordseg-0.4.1-py3.6.egg/wordseg/evaluate.py", line 131, in evaluate
IndexError: tuple index out of range

I wonder if it's to do with space separation in the output of wordseg-puddle? (I happen to be using puddle)

Line 1 of the phonemized file looks like this:

j uː ;eword v iː ;eword m ɔː ;eword k ʊ k ɪ z ;eword

Which means gold.txt looks like this:

juː viː mɔː kʊkɪz

And prepared.txt like this:

j uː v iː m ɔː k ʊ k ɪ z

However, segmented.puddle.txt has inconsistent spacing around ;eword delimiters:

juː;ewordviː;eword mɔː;eword kʊkɪz;eword

Is this the cause of the eval problem? Andrew

mmmaat commented 7 years ago

Hi Andrew, there is certainly some code to be improved in evaluate.py to properly deal with that error. I'll see.

But there is another problem, your segmented.puddle.txt should have word separators as " ", not ";ewords", as it is in gold.txt.

Do you remember how you obtained this segmentation output?

cainesap commented 7 years ago

My apologies! No need to fix anything. You are right, I was feeding the output of 'phonemizer' to 'wordseg-puddle' rather than the output of 'wordseg-prep' .. my mistake, I'm sorry. I will script this to avoid the same problem in future. With the right file, all is well again!