aalto-speech / morfessor

Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
http://morpho.aalto.fi
BSD 2-Clause "Simplified" License
180 stars 27 forks source link

Morfessor Models Sizes #1

Closed aboSamoor closed 7 years ago

aboSamoor commented 10 years ago

I am using morfessor with the word count genereted from Wikipedia. I noticed that the larger the word count file is, the larger the model is. Around 0.5GiB the pickle file is.

Is there a correlation?

What do you think the best practice is?

svirpioj commented 10 years ago

This is typical behavior for Morfessor models: the larger the training data, the larger the morph lexicon and the longer the average morph length.

There are several options to reduce the model size:

More details and discussion can be found, for example, in this article: http://dspace.utlib.ee/dspace/handle/10062/17313

If your concern is just the size of the model file, you can try saving it in gzipped Morfessor 1.0 format. Slower to load and doesn't store any training parameters, but should be smaller.

psmit commented 10 years ago

In the next version (which will be released in the coming months), there is an option for storing a reduced model; a model that can only be used for segmenting data.

aboSamoor commented 9 years ago

Is there any progress on the issue of reducing the size of the trained models?

psmit commented 9 years ago

Yes, we have implemented reduced models, and we have been using internally for a long time. The release of Morfessor 2.1 should come someday soon, but until then you can already use this branch: https://github.com/phsmit/morfessor/tree/develop

On the command line there is the --save-reduced option, in the code it is model.make_segment_only()

aboSamoor commented 9 years ago

It indeed reduces the size of the models. It seems the option is already available on the pypi package (Morfessor 2.0.2alpha1), is it necessary to use this development branch?

Once I train a model, can I use the pypi version to actually segment, or I still need the development branch to segment text.

I am developing a package that will use morfessor as the backend for text segmentation and I would like to use the pypi package to manage my dependencies.

psmit commented 9 years ago

Ah, indeed. No need to use the development branch. The models between the develop and alpha branch should be interchangable, but I can't guarantee it. We are thinking of more persistent models, but is not easy...

bhashi12 commented 7 years ago

i've just downloaded Morfessor-2.0.2a4 in Ubuntu. I couldnot load Morfessor 1.0 style text model, its throwing error of" no such directory exist". Where could i find this file.

psmit commented 7 years ago

@bhashi12 Sorry, I had not seen this question before. If it still persists, would you open a new issue?