ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

tacl stats often generates incorrect statistics #9

Closed ajenhl closed 11 years ago

ajenhl commented 11 years ago

tacl stats will often generate incorrect incorrect values for the number of matching tokens. When a matching piece of text is longer than the largest size of n-gram in the database, it will be represented in the (reduced) results as two or more overlapping n-grams of maximum size. The overlapping tokens will be counted multiple times, leading to incorrect values.

To get an accurate value, the matches can be applied to the stripped text and counted from that.

ajenhl commented 11 years ago

With the addition of functionality to create fully extended results in 75cdd57, this is no longer a problem, though it does require that an extended and reduced matches file be provided.