Use total token length for sorting labelled groups

The current intersect code sorts the labelled groups by the number of texts in the group, in order to have the smallest group in the innermost sub-query. Since the total tokens in each text is recorded in the database, and is a much more accurate reflection of the number of unique n-grams in a group, use that for sorting instead.

This should be particularly useful when performing corpus-wide text-to-text comparisons.

ajenhl / tacl

Use total token length for sorting labelled groups #13