aalto-speech / morfessor

Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
http://morpho.aalto.fi
BSD 2-Clause "Simplified" License
180 stars 27 forks source link

Sample data lines for Turkish or English #7

Closed ahmetax closed 7 years ago

ahmetax commented 7 years ago

I want to use Morfessor to separate Turkish words into stem+suffixes. I don't have a sample database. So, I must create a new data set for training. Can you give me some explanatory example data lines in Turkish, or English that should be in the data set? Thanks.

Waino commented 7 years ago

Have you noticed Morfessor FlatCat https://github.com/aalto-speech/flatcat ? It may be more suitable for your needs, if you want to distinguish between stems and suffixes.

Some Turkish data is available from http://morpho.aalto.fi/events/morphochallenge2010/datasets.shtml .

Note that the words have been lowercased, and mapped onto latin characters by replacing the letters specific to the Turkish language are replaced by capital letters. You may need to do some transformations on your data before training or use.