ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

Use total token length for sorting labelled groups #13

Closed ajenhl closed 10 years ago

ajenhl commented 10 years ago

The current intersect code sorts the labelled groups by the number of texts in the group, in order to have the smallest group in the innermost sub-query. Since the total tokens in each text is recorded in the database, and is a much more accurate reflection of the number of unique n-grams in a group, use that for sorting instead.

This should be particularly useful when performing corpus-wide text-to-text comparisons.