[FIX] Statistics - Regex count in whole document to only token

PrimozGodec commented 1 year ago

Issue

Regex counter counts only appearances in tokens, which ignore multi-word appearances.

Description of changes

As discussed with @ajdapretnar, I added a dropdown beside each statistic so that the user can decide whether to do computation on tokens/ngrams or a full document. Currently, it includes two options:

Preprocessed tokens - Statistics are computed on either tokes or ngrams, depending on what is more suitable for the statistic.
Documents - statistic computed on full document text

Discussion

~Is Preprocessed tokens a good term, or do we have any other idea?~ Changed to Tokens
~Average word length is currently implemented only on documents since the name doesn't make sense on N-grams. Should we rename it to Average term length and apply it to documents and n-grams? So that it is word length on documents and ngram length on ngrams.~ Renamed to Average term length and enabled for ngrams.

Includes

[X] Code changes
[X] Tests
[ ] Documentation

codecov-commenter commented 1 year ago

Codecov Report

Merging #1014 (6583069) into master (e7c360d) will increase coverage by 0.18%. Report is 7 commits behind head on master. The diff coverage is 96.03%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1014 +/- ## ========================================== + Coverage 82.18% 82.37% +0.18% ========================================== Files 92 92 Lines 12283 12381 +98 Branches 1670 1690 +20 ========================================== + Hits 10095 10199 +104 + Misses 1880 1866 -14 - Partials 308 316 +8 ```

PrimozGodec commented 10 months ago

/rebase

biolab / orange3-text