The current intersect code sorts the labelled groups by the number of texts in the group, in order to have the smallest group in the innermost sub-query. Since the total tokens in each text is recorded in the database, and is a much more accurate reflection of the number of unique n-grams in a group, use that for sorting instead.
This should be particularly useful when performing corpus-wide text-to-text comparisons.
The current intersect code sorts the labelled groups by the number of texts in the group, in order to have the smallest group in the innermost sub-query. Since the total tokens in each text is recorded in the database, and is a much more accurate reflection of the number of unique n-grams in a group, use that for sorting instead.
This should be particularly useful when performing corpus-wide text-to-text comparisons.