biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 84 forks source link

[ENH] Snowball - Use ISO language codes #1029

Closed PrimozGodec closed 11 months ago

PrimozGodec commented 11 months ago
Issue

This PR is part of https://github.com/biolab/orange3-text/pull/963, which I am splitting into smaller pieces for easier review.

The main motivation behind this is to make Preprocess work with language from Corpus.

Description of changes

This PR prepare a Snowball normalizer to communicate (get and return languages) as ISO codes, which is necessary to enable language from Corpus (languages are stored in Corpus in ISO format).

After I changed Snowball to work with ISO language codes, I also had to adapt the Preprocess Widget to store settings as ISO codes and call the Lemmagen filter with ISO language code.

Udpipe will be implemented in separate PRs.

Includes
PrimozGodec commented 11 months ago

@VesnaT as #1025, you can make a workflow to test migration with tag 1.15.0 (git checkout 1.15.0) and open it with this change. It should work.

codecov-commenter commented 11 months ago

Codecov Report

Merging #1029 (aa3306c) into master (0495fd5) will decrease coverage by 0.02%. The diff coverage is 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1029 +/- ## ========================================== - Coverage 82.21% 82.20% -0.02% ========================================== Files 93 93 Lines 12294 12295 +1 Branches 1668 1670 +2 ========================================== - Hits 10108 10107 -1 - Misses 1877 1880 +3 + Partials 309 308 -1 ```