[FIX] Corpus - remove dictionary and fix wrong types count on subsampled corpus

biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3

Other

128 stars 84 forks source link

Issue

Fixes https://github.com/biolab/orange3-text/issues/920 Corpus saves a dictionary (Gensim Dictionary) which is created on first need and cached. The problem with the dictionary is that it stays the same after subsampling Corpus (creating a corpus with the subset of documents) even though the number of unique tokens changes. The most problematic is that it was used to access a number of unique tokens in Corpus at different locations in the addon. The information was incorrect after the corpus was subsampled (issue in #920).

Description of changes

Since the dictionary was primarily introduced for Topic modelling purposes and topic modelling does not use it anymore, I decided to remove it from Corpus. All pieces of code that use a dictionary can be written differently.

This PR so removes the dictionary and updates all the code that uses it.

Includes

[X] Code changes
[X] Tests
[ ] Documentation

Codecov Report

Merging #990 (d76acaa) into master (9c0faca) will increase coverage by 0.02%. The diff coverage is 100.00%.

:exclamation: Current head d76acaa differs from pull request most recent head 946cab6. Consider uploading reports for the commit 946cab6 to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #990 +/- ## ========================================== + Coverage 79.66% 79.69% +0.02% ========================================== Files 87 87 Lines 12319 12326 +7 Branches 1617 1620 +3 ========================================== + Hits 9814 9823 +9 + Misses 2211 2210 -1 + Partials 294 293 -1 ```

biolab / orange3-text