NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
195 stars 41 forks source link

Batch suggest in SVC backend #670

Closed osma closed 1 year ago

osma commented 1 year ago

This PR implements _suggest_batch instead of single-document _suggest in the SVC backend, allowing the use of more efficient vector operations.

I tested this with the 20 Newsgroups data set example, as shown on the wiki page of the SVC backend. Evaluation on the test set is 20-30% faster with the results otherwise unchanged.

There is one special case that requires attention: what to do when the input to the SVC classifier is empty or near-empty. The old _suggest code handled this as a special case so it short-circuits the classifier and returns an empty result. There is also a unit test that tests this (using the special input text "j" which is an unknown token and thus equivalent to empty input). The SVC model has no problem returning some class (maybe the majority class?) in this case, but I reimplemented the special case check for empty input to retain the old behaviour that refuses to classify it and keeps the unit test happy.

With 1 job

user time wall time max rss
before (master) 20.36 0:20.20 237848
after (PR) 14.35 0:14.17 237332

With 4 jobs

user time wall time max rss
before (master) 23.16 0:07.92 186972
after (PR) 15.41 0:06.02 187756

Fixes #667

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

codecov[bot] commented 1 year ago

Codecov Report

Base: 99.56% // Head: 99.56% // Decreases project coverage by -0.01% :warning:

Coverage data is based on head (1c587bd) compared to base (cc6dfcf). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #670 +/- ## ========================================== - Coverage 99.56% 99.56% -0.01% ========================================== Files 87 87 Lines 6145 6142 -3 ========================================== - Hits 6118 6115 -3 Misses 27 27 ``` | [Impacted Files](https://codecov.io/gh/NatLibFi/Annif/pull/670?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/backend/svc.py](https://codecov.io/gh/NatLibFi/Annif/pull/670?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9zdmMucHk=) | `100.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.