NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
195 stars 41 forks source link

Support for batch suggest operations for CLI commands #663

Closed juhoinkinen closed 1 year ago

juhoinkinen commented 1 year ago

Adds support for passing multiple documents in a batch from suggest, index, eval and optimize CLI commands to backends (like discussed in issue #579).

Makes annif suggest accept path(s) to file(s) to be indexed, in addition to stdin:

annif suggest yso-tfidf-en <document.txt              # just like before
annif suggest yso-tfidf-en document.txt               # the same, but from a named file
annif suggest yso-tfidf-en doc1.txt doc2.txt doc3.txt # many files
annif suggest yso-tfidf-en doc*.txt                   # similar to above, but using shell expansion
annif suggest yso-tfidf-en -                          # stdin with dash
annif suggest yso-tfidf-en doc1.txt -                 # mixing file and stdin with dash

The output for each file path input begins with line Suggestions for <file/path>:

Suggestions for tests/corpora/archaeology/fulltext/440866.txt
<http://www.yso.fi/onto/yso/p6218>  riimukirjoitus  0.3213897943496704
<http://www.yso.fi/onto/yso/p6479>  viikingit   0.18659920990467072
<http://www.yso.fi/onto/yso/p12738> viikinkiaika    0.18625082075595856
<http://www.yso.fi/onto/yso/p22768> Kiinan muuri    0.15950888395309448
<http://www.yso.fi/onto/yso/p3973>  antiikki    0.13840530812740326
<http://www.yso.fi/onto/yso/p14588> riimukivet  0.1362432837486267
<http://www.yso.fi/onto/yso/p14173> kaivaukset  0.1201547235250473
<http://www.yso.fi/onto/yso/p5713>  hautalöydöt 0.11249098181724548
<http://www.yso.fi/onto/yso/p15031> viikinkiretket  0.11039584875106812
<http://www.yso.fi/onto/yso/p5714>  muinaishaudat   0.10336380451917648

The documents are passed as batches, i.e. lists of texts (lists are generated by a generator also in the case of suggest command with files) to the backend.py module, which defines the default _suggest_batch method that uses the regular, single-text _suggest method. This allows the actual backends to define their own _suggest_batch methods that can operate on the document batch.

The implementation of document batching for the hyperopt CLI command is left from this PR, as the hyperopt functionality is implemented in individual backends.

Note that there is no support the actual batch-processing in any backends yet.

codecov[bot] commented 1 year ago

Codecov Report

Base: 99.55% // Head: 89.44% // Decreases project coverage by -10.11% :warning:

Coverage data is based on head (ec16a21) compared to base (ca4d61c). Patch coverage: 97.94% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #663 +/- ## =========================================== - Coverage 99.55% 89.44% -10.11% =========================================== Files 87 87 Lines 6017 6142 +125 =========================================== - Hits 5990 5494 -496 - Misses 27 648 +621 ``` | [Impacted Files](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [tests/test\_backend\_fasttext.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9iYWNrZW5kX2Zhc3R0ZXh0LnB5) | `100.00% <ø> (ø)` | | | [tests/test\_backend\_nn\_ensemble.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9iYWNrZW5kX25uX2Vuc2VtYmxlLnB5) | `6.77% <0.00%> (-93.23%)` | :arrow_down: | | [tests/test\_backend\_omikuji.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9iYWNrZW5kX29taWt1amkucHk=) | `5.15% <0.00%> (-94.85%)` | :arrow_down: | | [tests/test\_backend\_pav.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9iYWNrZW5kX3Bhdi5weQ==) | `100.00% <ø> (ø)` | | | [tests/test\_backend\_yake.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9iYWNrZW5kX3lha2UucHk=) | `6.94% <0.00%> (-93.06%)` | :arrow_down: | | [annif/backend/backend.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9iYWNrZW5kLnB5) | `100.00% <100.00%> (ø)` | | | [annif/backend/dummy.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9kdW1teS5weQ==) | `100.00% <100.00%> (ø)` | | | [annif/backend/ensemble.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9lbnNlbWJsZS5weQ==) | `100.00% <100.00%> (ø)` | | | [annif/backend/pav.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9wYXYucHk=) | `98.87% <100.00%> (ø)` | | | [annif/cli.py](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvY2xpLnB5) | `99.70% <100.00%> (+0.01%)` | :arrow_up: | | ... and [28 more](https://codecov.io/gh/NatLibFi/Annif/pull/663?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

juhoinkinen commented 1 year ago

Some open thoughts & questions:

juhoinkinen commented 1 year ago

I reverted the commits about REST API (i.e. adding the /v1/suggest-batch method) and cherry-picked them to a new branch issue579-batch-suggest-operation-rest-api. This PR was already quite large, and might grow further if batching in other CLI commands is implemented.

juhoinkinen commented 1 year ago

I tried to utilize the new suggest_batch function of the project module for the eval CLI command, but it does not seem very straightforward.

First, the eval command uses the imap_unordered, which expects an iterable as the second argument), but suggest_batch operates on a corpus object (that has documents iterable only as its property). Also, I'm not sure if imap_unordered could in any way feed multiple documents to the suggest_batch function: it has the chunksize parameter, but afaik even when setting that to a non-default value (!=1) it does not send multiple elements from the iterable to the function in one pass.

juhoinkinen commented 1 year ago

I added BatchingDocumentCorpus to help getting batches of documents, but I wonder could the doc_batches(batch_size) method be added already to the DocumentCorpus base class, as batching of documents is used quite many times. (TODO: add testing of the batching method.)

(Black v23.1.0 was just released, and it introduced some changes to the style, which raise complaints for the current Annif code. :( )

osma commented 1 year ago

I wonder could the doc_batches(batch_size) method be added already to the DocumentCorpus base class, as batching of documents is used quite many times.

Yes, this sounds like a great idea! I think it would simplify the code and enable batching in many places.

osma commented 1 year ago

Oh by the way, I tested the annif eval command (yso-mllm-fi project, evaluated on kirjastonhoitaja test set) with the -j 4 option, before and after this PR. The evaluation results were the same. It took a few seconds longer with the code in this PR, probably because the parallel processing is done on larger batches and the final batches cannot be distributed between CPUs. But I think that's not too bad, and we should see performance gains in the future as we implement support for batching in individual backends.

juhoinkinen commented 1 year ago

Now some debug-level logs from project module are removed: Before:

annif suggest tfidf-fi <kissa.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Got 100 hits from backend tfidf
debug: 100 hits from backend
<http://www.yso.fi/onto/yso/p19378> kissa   0.9530429244041443
<http://www.yso.fi/onto/yso/p864>   kissaeläimet    0.5541983842849731

Now:

annif suggest tfidf-fi <kissa.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
<http://www.yso.fi/onto/yso/p19378> kissa   0.9530429244041443
<http://www.yso.fi/onto/yso/p864>   kissaeläimet    0.5541983842849731

And in the case giving multiple files to annif suggest, the debug log has some duplication:

annif suggest tfidf-fi *.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug:  cat
debug: dog
debug: viiki..." (len=24)
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "koira
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "laiva
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "viikinki
debug: ..." (len=9)
osma commented 1 year ago

Now some debug-level logs from project module are removed

Right, because the debug information was printed only in the single text suggest methods. I don't think this is a big loss.

And in the case giving multiple files to annif suggest, the debug log has some duplication:

I don't see any exact duplicates - the messages are related to different input files, right?

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication