Closed alecristia closed 6 years ago
This is already implemented but you have to specify empty separators for phones and syllables. The command cat test/data/orthographic.txt | wordseg-stats -p "" -s "" -w " " -v
gives:
2018-05-18 14:32:58,185 - wordseg-stats - WARNING - phone separator not defined, some stats ignored
2018-05-18 14:32:58,185 - wordseg-stats - WARNING - syllable separator not defined, some stats ignored
2018-05-18 14:32:58,185 - wordseg-stats - INFO - token separator is (word: " ")
2018-05-18 14:32:58,185 - wordseg-stats - INFO - loaded 301 utterances
2018-05-18 14:32:58,217 - wordseg-stats - INFO - parsed 1871 words
2018-05-18 14:32:58,218 - wordseg-stats - INFO - 5 most common words: ['the', 'you', 'to', 'and', 'a']
corpus nutts 301
corpus nutts_single_word 38
corpus mattr 0.9266523374529879
words tokens 1871
words types 552
words hapaxes 279
I just added that in the wordseg-stats --help
message:
To analyze a segmented text or a text in orthographic form (i.e. with
word separators only), you must define empty phone and syllable
separators (see the token separation arguments below).
it could be something like: If the only marker found are spaces, then wordseg-stats prints out only the word-based statistics, with others NA, and perhaps a sentence of warning