ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

tacl stats can't rely on just the supplied results #29

Closed ajenhl closed 9 years ago

ajenhl commented 9 years ago

tacl stats currently generates the count of matching tokens by multiplying each n-grams size by its number of occurrences, and summing them for each variant. Even when operating on reduced results, this is by no means guaranteed to be accurate, as two listed n-grams may still overlap. Eg, the intersection in 2-grams of "the" and "heth", where the matching token count for "the" text is four when it ought to be three.

The statistics report must refer to the actual text to generate accurate statistics, substituting whitespace for all occurrences of each n-gram in the text, from largest n-gram to smallest, and getting a count of the number of tokens remaining (that aren't whitespace) and subtracting that from the total tokens in the original text.

Since the source text is being used, this means that there is no need to supply the output of tacl counts to the StatisticsReport, since the total tokens can be counted from the source text.