Closed alecristia closed 6 years ago
Yes you're right, "unique mean the number of types that occur exactly once", we can change the name if you want?
I was thinking about removing "nword_hapax" and "nwords_type" from the "corpus" category, as they are redundant with the ones in "words", what do you think?
Finally on that exemple I guess you maybe have an issue with the token separators: You have more syllable types than word types, and the same number of "uniques"... Weird.
could we call them "hapax" instead of unique?
I agree, let's get rid of redundant things... Do you think we should also removed the derived estimations (ie measures that can be recalculated)? I'd vote for yes.
Thinking some more, it makes a lot of sense that few if any phones occur only once (phones tend to be reused); I haven't found the file that corresponded to that stats, but I cannot reproduce that pattern (uniques in words and syllables don't tend to be the same, they have no reason to!) So looks like all is good there
en fin, adding some reordering, the output of stats would ideally look like { "phones": { "tokens": 4870, "types": 39, "hapaxes": 0 }, "syllables": { "tokens": 1934, "types": 387, "hapaxes": 144 }, "words": { "tokens": 1597, "types": 354, "hapaxes": 147, }, "corpus": { "nutts": 393, "nutts_single_word": 81, "mattr": 0.860176433522369, "entropy": 0.02052822488004237 } }
Ok Alex I'll modify the wordseg-stats with those specs and let you know!
Is it odd that uniques = 1 for phones; that nword_hapax = uniques syllables = uniques words = 83 in the sample below? Does unique mean the number of types that occur exactly once?
sample
{ "phones": { "tokens/word": 3.0965637233579817, "uniques": 1, "token/types": 177.975, "tokens": 7119, "tokens/syllable": 2.500526870389884, "tokens/utt": 11.175824175824175, "types": 40 }, "corpus": { "nword_hapax": 83, "nword_types": 301, "mattr": 0.7118829183049362, "nutts_single_word": 124, "entropy": 0.023474054557757, "nutts": 637, "nword_tokens": 2299 }, "syllables": { "tokens/word": 1.23836450630709, "uniques": 83, "token/types": 8.398230088495575, "tokens": 2847, "tokens/utt": 4.469387755102041, "types": 339 }, "words": { "tokens": 2299, "tokens/utt": 3.609105180533752, "uniques": 83, "token/types": 7.637873754152824, "types": 301 } }