Closed alecristia closed 6 years ago
As a matter of fact, I was not replicating them precisely, but they were quite close to what Elin had, somewhat better maybe. If comparing the f-score syllable Elin/replication, I had Type Token Boundary Elin 0.42 0.60 0.80 Replic 0.44 0.73 0.87
Alex, what kind of input unit did you use ? For bernstein ADS in dec (2016?) it was the phone unit. What gladys is talking about is with the syllable unit. Le 1 mars 2018 à 16:44, GladB notifications@github.com<mailto:notifications@github.com> a écrit :
As a matter of fact, I was not replicating them precisely, but they were quite close to what Elin had, somewhat better maybe. If comparing the f-score syllable Elin/replication, I had Type Token Boundary Elin 0.42 0.60 0.80 Replic 0.44 0.73 0.87
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/bootphon/wordseg/issues/30#issuecomment-369633205, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI7ayo1EXLqk1D5BXtKXB8T_LYY9ZYneks5taBdJgaJpZM4SYex8.
Well I do have different results when using phrasal instead of baseline, however :
Hello,
This is not wordseg-dibs -t $type -o dibs_${type}_prep_prep.txt prepared.txt prepared.txt
but wordseg-dibs -t $type -o dibs_${type}_prep_prep.txt prepared.txt **tags.txt**
, the train file must have word separators.
It was pretty confusing in the docs and not checked in the code. The commit 8fd1addee3f6841d5698b6f48c50941737062ec1 fix those two points.
With the modif, I obtained (on test/data/tagged.txt
)
res_dibs_phone_baseline_prep_200.txt:token_precision 0.6308
res_dibs_phone_baseline_prep_prep.txt:token_precision 0.7084
res_dibs_phone_phrasal_prep_200.txt:token_precision 0.3243
res_dibs_phone_phrasal_prep_prep.txt:token_precision 0.3858
Actually there is a test here ensuring the results are replicated on test/data/tagged.txt
.
Just running ../CDSwordSeg/algoComp/segment.py ./test/data/tagged.txt -a dibs
I got the expected results:
token_f-score token_precision token_recall boundary_f-score boundary_precision boundary_recall
0.239 0.3243 0.1892 0.4804 0.7161 0.3614
Your bug seems to be related to #31, it should be fixed by commit 85603958e864e45e842fce3375f28d4a802797b6.
So for me all is good with dibs now, let me know if you disagree ;)
Oops, yes, sorry, I figured that out -- I should have updated my question.
I'm still trying to understand things - I fear I did something stupid (again), but if I haven't, then I really don't understand how dibs works. Check out this performance table comparing baseline versus phrasal, 200 lines versus whole-corpus training, in the new package and in the very initial version of dibs (i.e. via direct call). Several weird things, including no difference in performance for whole vs 200 lines only in phrasal; and performance too close between baseline and phrasal.
NEW PACKAGE
OLD PACKAGE
Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t baseline prepared.txt $thistag | wordseg-eval gold.txt
Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t baseline prepared.txt bit.txt | wordseg-eval gold.txt
Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t phrasal prepared.txt $thistag | wordseg-eval gold.txt
Alejandrinas-MacBook-Air:wordseg acristia$ wordseg-dibs -t phrasal prepared.txt bit.txt | wordseg-eval gold.txt
Baseline+whole
baseline+200lines
phrasal+whole
phrasal+200lines
phrasal+whole
phrasal+200lines
type_fscore
0,5446
0,1813
0,5396
0,5397
0,1212
0,5401
token_fscore
0,7271
0,4088
0,7276
0,7276
0,09842
0,7276
type_precision
0,6335
0,1587
0,6323
0,6326
0,07962
0,6337
boundary_recall
1
0,7909
0,9999
0,9999
0,3043
1
boundary_fscore
0,8781
0,661
0,8779
0,8779
0,3867
0,878
token_precision
0,6615
0,3602
0,6619
0,6619
0,1235
0,6619
type_recall
0,4775
0,2114
0,4706
0,4706
0,2537
0,4706
token_recall
0,8072
0,4724
0,8076
0,8076
0,08181
0,8077
boundary_precision
0,7827
0,5678
0,7825
0,7825
0,5304
0,7825
If I remember well, Gladys did see a difference of performance for DiBS when using 200 lines versus whole corpus with the wordseg package, right ?
I did see a difference using baseline, but I never used phrasal, so I couldn't say
Hello,
I found and fixed a bug in dibs training at syllable level, sorry! The train file was always loaded at phone level, so I added a --unit argument to wordseg-dibs (as in wordseg-prep).
Here what I obtain after the bugfix, on test/data/tagged.txt
, we see a difference baseline/phrasal at syllable level:
level type #train fscore
------ ------ ------- -------
phone phrasal 20 0.1452
phone phrasal 200 0.239
phone baseline 20 0.1664
phone baseline 200 0.6152
syllable phrasal 20 0.665
syllable phrasal 200 0.6631
syllable baseline 20 0.665
syllable baseline 200 0.6656
Here the script I used:
tags=/home/mathieu/dev/wordseg/test/data/tagged.txt
head -200 $tags > 200.txt
head -20 $tags > 20.txt
echo "level type #train fscore"
echo "------ ------ ------- -------"
for unit in phone syllable ; do
wordseg-prep $tags -u $unit -g gold.txt -o prepared.txt
for type in phrasal baseline ; do
for train in 20 200; do
fscore=$(wordseg-dibs -t $type -u $unit prepared.txt $train.txt | \
wordseg-eval gold.txt | grep token_fscore | \
sed -r 's/.*\t(.*)$/\1/g')
echo "$unit $type $train $fscore"
done
done
done
I hope your results will be coherent know, let me know!
I reanalyzed Bernstein ADS, which yielded a token F score of 0.2487 prior to Dec 2017 on CDSwordseg. Now in wordseg I get 0.1173 (with first 200 lines, as before; even lower if whole corpus used for training). And further the scores for baseline and phrasal are identical:
res_dibs_baseline_prep_200.txt:token_precision 0.1173 res_dibs_baseline_prep_prep.txt:token_precision 0.06634 res_dibs_phrasal_prep_200.txt:token_precision 0.1173 res_dibs_phrasal_prep_prep.txt:token_precision 0.06634
Gladys, were you replicating Elin's results precisely? And what were they?
The code used for the test was: