dibs baseline/phrasal not distinguished & performance too low

alecristia commented 6 years ago

I reanalyzed Bernstein ADS, which yielded a token F score of 0.2487 prior to Dec 2017 on CDSwordseg. Now in wordseg I get 0.1173 (with first 200 lines, as before; even lower if whole corpus used for training). And further the scores for baseline and phrasal are identical:

res_dibs_baseline_prep_200.txt:token_precision 0.1173 res_dibs_baseline_prep_prep.txt:token_precision 0.06634 res_dibs_phrasal_prep_200.txt:token_precision 0.1173 res_dibs_phrasal_prep_prep.txt:token_precision 0.06634

Gladys, were you replicating Elin's results precisely? And what were they?

The code used for the test was:

for type in phrasal baseline ; do

        wordseg-dibs -t $type -o dibs_${type}_prep_200.txt prepared.txt first200.txt
        wordseg-dibs -t $type -o dibs_${type}_prep_prep.txt prepared.txt prepared.txt
        wordseg-dibs -t $type -o dibs_${type}_200_prep.txt first200.txt prepared.txt
done

for j in dibs*.txt ; do
        cat $j | wordseg-eval gold.txt > res_${j}
done

GladB commented 6 years ago

As a matter of fact, I was not replicating them precisely, but they were quite close to what Elin had, somewhat better maybe. If comparing the f-score syllable Elin/replication, I had Type Token Boundary Elin 0.42 0.60 0.80 Replic 0.44 0.73 0.87

elinlarsen commented 6 years ago

Alex, what kind of input unit did you use ? For bernstein ADS in dec (2016?) it was the phone unit. What gladys is talking about is with the syllable unit. Le 1 mars 2018 à 16:44, GladB notifications@github.com<mailto:notifications@github.com> a écrit :

As a matter of fact, I was not replicating them precisely, but they were quite close to what Elin had, somewhat better maybe. If comparing the f-score syllable Elin/replication, I had Type Token Boundary Elin 0.42 0.60 0.80 Replic 0.44 0.73 0.87

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/bootphon/wordseg/issues/30#issuecomment-369633205, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI7ayo1EXLqk1D5BXtKXB8T_LYY9ZYneks5taBdJgaJpZM4SYex8.

GladB commented 6 years ago

Well I do have different results when using phrasal instead of baseline, however :

on syllables, I have the same results whether I use python or bash
on phonemes, I find different results using python than when using bash (using python, I have about the same results as Elin had) (it doesn't make any sense, I know, so I must have made a mistake somewhere)

mmmaat commented 6 years ago

Hello,

@alecristia

This is not wordseg-dibs -t $type -o dibs_${type}_prep_prep.txt prepared.txt prepared.txt but wordseg-dibs -t $type -o dibs_${type}_prep_prep.txt prepared.txt **tags.txt**, the train file must have word separators.

It was pretty confusing in the docs and not checked in the code. The commit 8fd1addee3f6841d5698b6f48c50941737062ec1 fix those two points.

With the modif, I obtained (on test/data/tagged.txt)

res_dibs_phone_baseline_prep_200.txt:token_precision    0.6308
res_dibs_phone_baseline_prep_prep.txt:token_precision   0.7084
res_dibs_phone_phrasal_prep_200.txt:token_precision 0.3243
res_dibs_phone_phrasal_prep_prep.txt:token_precision    0.3858

CDSWordSeg replication

Actually there is a test here ensuring the results are replicated on test/data/tagged.txt.

Just running ../CDSwordSeg/algoComp/segment.py ./test/data/tagged.txt -a dibs I got the expected results:

token_f-score   token_precision token_recall    boundary_f-score    boundary_precision  boundary_recall
0.239   0.3243  0.1892  0.4804  0.7161  0.3614

@GladB

Your bug seems to be related to #31, it should be fixed by commit 85603958e864e45e842fce3375f28d4a802797b6.

mmmaat commented 6 years ago

So for me all is good with dibs now, let me know if you disagree ;)

alecristia commented 6 years ago

Oops, yes, sorry, I figured that out -- I should have updated my question.

I'm still trying to understand things - I fear I did something stupid (again), but if I haven't, then I really don't understand how dibs works. Check out this performance table comparing baseline versus phrasal, 200 lines versus whole-corpus training, in the new package and in the very initial version of dibs (i.e. via direct call). Several weird things, including no difference in performance for whole vs 200 lines only in phrasal; and performance too close between baseline and phrasal.

NEW PACKAGE

OLD PACKAGE

Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t baseline prepared.txt $thistag | wordseg-eval gold.txt

Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t baseline prepared.txt bit.txt | wordseg-eval gold.txt

Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t phrasal prepared.txt $thistag | wordseg-eval gold.txt

Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t phrasal prepared.txt bit.txt | wordseg-eval gold.txt

Baseline+whole

baseline+200lines

phrasal+whole

phrasal+200lines

phrasal+whole

phrasal+200lines

type_fscore

0,5446

0,1813

0,5396

0,5397

0,1212

0,5401

token_fscore

0,7271

0,4088

0,7276

0,09842

0,7276

type_precision

0,6335

0,1587

0,6323

0,6326

0,07962

0,6337

boundary_recall

1

0,7909

0,9999

0,3043

1

boundary_fscore

0,8781

0,661

0,8779

0,3867

0,878

token_precision

0,6615

0,3602

0,6619

0,1235

0,6619

type_recall

0,4775

0,2114

0,4706

0,2537

0,4706

token_recall

0,8072

0,4724

0,8076

0,08181

0,8077

boundary_precision

0,7827

0,5678

0,7825

0,5304

0,7825

elinlarsen commented 6 years ago

If I remember well, Gladys did see a difference of performance for DiBS when using 200 lines versus whole corpus with the wordseg package, right ?

GladB commented 6 years ago

I did see a difference using baseline, but I never used phrasal, so I couldn't say

mmmaat commented 6 years ago

Hello,

I found and fixed a bug in dibs training at syllable level, sorry! The train file was always loaded at phone level, so I added a --unit argument to wordseg-dibs (as in wordseg-prep).

Here what I obtain after the bugfix, on test/data/tagged.txt, we see a difference baseline/phrasal at syllable level:

level     type      #train   fscore
------    ------    -------  -------
phone     phrasal   20       0.1452
phone     phrasal   200      0.239
phone     baseline  20       0.1664
phone     baseline  200      0.6152
syllable  phrasal   20       0.665
syllable  phrasal   200      0.6631
syllable  baseline  20       0.665
syllable  baseline  200      0.6656

Here the script I used:

tags=/home/mathieu/dev/wordseg/test/data/tagged.txt
head -200 $tags > 200.txt
head -20 $tags > 20.txt

echo "level type #train fscore"
echo "------ ------ ------- -------"

for unit in phone syllable ; do
    wordseg-prep $tags -u $unit -g gold.txt -o prepared.txt
    for type in phrasal baseline ; do
        for train in 20 200; do
            fscore=$(wordseg-dibs -t $type -u $unit prepared.txt $train.txt | \
                            wordseg-eval gold.txt | grep token_fscore | \
                            sed -r 's/.*\t(.*)$/\1/g')
            echo "$unit $type $train $fscore"
        done
    done
done

I hope your results will be coherent know, let me know!

bootphon / wordseg

dibs baseline/phrasal not distinguished & performance too low #30