bootphon / wordseg

A Python toolbox for text based word segmentation
https://docs.cognitive-ml.fr/wordseg
GNU General Public License v3.0
16 stars 7 forks source link

allow wordseg-stats to take as input a segmented file (output of segmentation) #41

Closed alecristia closed 6 years ago

alecristia commented 6 years ago

it could be something like: If the only marker found are spaces, then wordseg-stats prints out only the word-based statistics, with others NA, and perhaps a sentence of warning

mmmaat commented 6 years ago

This is already implemented but you have to specify empty separators for phones and syllables. The command cat test/data/orthographic.txt | wordseg-stats -p "" -s "" -w " " -v gives:

2018-05-18 14:32:58,185 - wordseg-stats - WARNING - phone separator not defined, some stats ignored
2018-05-18 14:32:58,185 - wordseg-stats - WARNING - syllable separator not defined, some stats ignored
2018-05-18 14:32:58,185 - wordseg-stats - INFO - token separator is (word: " ")
2018-05-18 14:32:58,185 - wordseg-stats - INFO - loaded 301 utterances
2018-05-18 14:32:58,217 - wordseg-stats - INFO - parsed 1871 words
2018-05-18 14:32:58,218 - wordseg-stats - INFO - 5 most common words: ['the', 'you', 'to', 'and', 'a']
corpus nutts 301
corpus nutts_single_word 38
corpus mattr 0.9266523374529879
words tokens 1871
words types 552
words hapaxes 279

I just added that in the wordseg-stats --help message:

To analyze a segmented text or a text in orthographic form (i.e. with
word separators only), you must define empty phone and syllable
separators (see the token separation arguments below).