tacl stats will often generate incorrect incorrect values for the number of matching tokens. When a matching piece of text is longer than the largest size of n-gram in the database, it will be represented in the (reduced) results as two or more overlapping n-grams of maximum size. The overlapping tokens will be counted multiple times, leading to incorrect values.
To get an accurate value, the matches can be applied to the stripped text and counted from that.
With the addition of functionality to create fully extended results in 75cdd57, this is no longer a problem, though it does require that an extended and reduced matches file be provided.
tacl stats will often generate incorrect incorrect values for the number of matching tokens. When a matching piece of text is longer than the largest size of n-gram in the database, it will be represented in the (reduced) results as two or more overlapping n-grams of maximum size. The overlapping tokens will be counted multiple times, leading to incorrect values.
To get an accurate value, the matches can be applied to the stripped text and counted from that.