generate AG unigram on the basis of corpus

bootphon / wordseg

A Python toolbox for text based word segmentation

https://docs.cognitive-ml.fr/wordseg

GNU General Public License v3.0

16 stars 7 forks source link

generate AG unigram on the basis of corpus #35

Closed alecristia closed 6 years ago

alecristia commented 6 years ago

The top is always: 1 1 Sentence --> Colloc0s 1 1 Colloc0s --> Colloc0 1 1 Colloc0s --> Colloc0 Colloc0s Colloc0 --> Phonemes 1 1 Phonemes --> Phoneme 1 1 Phonemes --> Phoneme Phonemes

followed by lines like: 1 1 Phoneme --> XX where XX is a possible unit. To find all possible units, do something like: cat prepared.txt | tr ' ' '\n' | uniq | sort

There will be as many lines as units there are in the prepared corpus.

mmmaat commented 6 years ago

This is already done in python but not exposed in bash (see https://github.com/bootphon/wordseg/blob/c77b46cf6926fc732b5f50a39698822fbc5bbe9e/wordseg/algos/ag.py#L212).

I can add an option in the bash command, for instance --generate-grammar or something like that, what do you think ?

alecristia commented 6 years ago

sounds terrific!

mmmaat commented 6 years ago

Now the grammar and category arguments are optional. Use it as cat prep.txt | wordseg-ag --grammar file.lt --category Colloc0. When not specified a colloc0 grammar is generated automatically. I also updated the tutorial with that new syntax.