aalto-speech / morfessor

Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
http://morpho.aalto.fi
BSD 2-Clause "Simplified" License
185 stars 29 forks source link

Where's the detail specific document of training data rules? #20

Closed jarodtang closed 4 years ago

jarodtang commented 4 years ago

Hi There,

I tried to craft some simple training like

design de sign, de sign
gender gen der, gen der
bilingual bi lingual, bi lingual
biography bio graphy, bio graphy

for testing list as

design
gender
bilingual
biography

and got the result as

 morfessor -t td1.txt -S model.segm -T text.txt 
Reading corpus from 'td1.txt'...
Detected utf-8 encoding
Done.
Compounds in training data: 16 types / 16 tokens
Starting batch training
Epochs: 0   Cost: 344.6809466060173
.................
Epochs: 1   Cost: 206.03260380373735
.................
Epochs: 2   Cost: 206.0326038037374
Done.
Epochs: 2
Final cost: 206.0326038037374
Training time: 0.017s
Saving segmentations to 'model.segm'...
Done.
Segmenting test data...
Reading corpus from 'text.txt'...
de sign
gen der
bi lingual
bi o graphy
Done.

Done.

Where the expected results is

de sign
gen der
bi lingual
bio graphy

My question is

-R Jarod

Waino commented 4 years ago

The specifications for the data formats can be found in the online documentation https://morfessor.readthedocs.io/en/latest/filetypes.html

It seems like you have written an annotation file (although I don't see the point of the repeated identical segmentation alternatives). Annotation files (specified with --annotations) are additional data used for semi-supervised training. The main training data file is a corpus or a word count list specified with --traindata.