NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
195 stars 41 forks source link

Refactor: Represent suggestion results as sparse arrays #681

Closed osma closed 1 year ago

osma commented 1 year ago

This PR reimplements the representation and handling of subject suggestions so that they are represented as sparse arrays that contain the suggestions for a batch of documents. Fixes #678 where the idea was originally proposed.

Old approach

In the old approach, there was an abstract class SuggestionResult and its concrete subclasses VectorSuggestionResult and ListSuggestionResult (as well as a LazySuggestionResult wrapper class). All of these were oriented around representing the results for a single document, so when performing operations on multiple documents (e.g. annif eval), there were lots of instances of the SuggestionResult classes and operations such as filtering (by limit and threshold) had to be performed separately on each instance, which can be inefficient. Also, VectorSuggestionResult stores 1D NumPy arrays which, in the case of a large vocabulary such as YSO, usually contain mostly zeros, which is a waste of memory. There was also extra complexity because of the two alternative representations and the need to convert between them depending on which representation was more appropriate for a given purpose.

New approach

This PR removes all the above mentioned classes. Instead suggestion results are represented using SciPy sparse arrays that contain the suggestions for a whole batch of documents (up to 32). The new classes in the annif.suggestion module are:

Like their predecessors, all of these classes are intended to be immutable (although I guess this could be better enforced in the code). So once created, they will not change; operations such as .filter() will return a new instance.

Notes about sparse arrays

SciPy sparse matrices have been around for a long time, but sparse arrays are relatively new (introduced in SciPy 1.8 released in February 2022). The arrays are very similar to matrices (and currently implemented as a subclass), but their API is closer to that of NumPy arrays. SciPy documentation now recommends the use of sparse arrays (instead of matrices) for new code, so I decided to follow that recommendation. However, there are a few cases in this PR where SciPy methods on sparse arrays return matrices instead of arrays, and they have to be explicitly converted to arrays.

Changes to functionality

The original idea of this PR was to simply refactor without changing any functionality, yet I decided to do a couple of changes:

  1. this PR removes the LRAP evaluation metric, because the implementation in scikit-learn doesn't accept sparse arrays as input; it's not much used anyway, and nDCG is a very similar ranking metric, so I don't think this is a big loss.
  2. The reimplemented annif optimize command now supports a --jobs parameter for parallel processing, just like annif eval - this fixes #423 .

Performance

I had to do extensive benchmarking and many rounds of optimizations before I was happy with the results. The representation of suggestion results is very fundamental and performance-critical; any inefficient operations will easily degrade performance. Despite its flaws, the old implementation was pretty well optimized. I have categorized the benchmark results according to the outcome as measured by wall time: obvious improvement (more than ~5%), obvious slowdown (more than ~5%), and no significant difference (everything in between).

I performed the benchmarks on the 48 CPU physical server. There is sometimes quite a bit of variation in successive runs especially for short operations. In some cases (basic CLI commands) I repeated the operation 5 times and took the best wall time result. For the most part, and unless otherwise noted, I used the Finto AI 2022-11 pretrained Finnish language models and the "kirjastonhoitaja" train, validation or test sets depending on which was most appropriate for the operation.

Memory usage was mostly unchanged, though in a few cases (e.g. MLLM and SVC eval operations) there is a clear improvement, probably due to the use of sparse arrays instead of NumPy vectors which can take up more space. I have commented on memory usage in a few cases below.

Better performance

annif eval (tfidf, omikuji, fasttext, mllm, ensemble, nn_ensemble) - small improvements | tfidf eval | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | ---------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 147.37 | 136.89 | 151.87 | 141.18 | | wall time | 2:30.42 | 2:18.74 | 0:49.10 | 0:40.49 | | max rss | 621332 | 628936 | 409056 | 428944 | Here I used the Annif tutorial yso-nlf data set. Clearly faster than before, though memory usage has increased a little bit (but only around 10MB) in this case. | omikuji eval | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | ----------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 62.32 | 56.21 | 66.66 | 57.51 | | wall time | 1:04.78 | 0:59.01 | 1:01.39 | 0:53.45 | | max rss | 6026176 | 5924520 | 6001608 | 5911072 | A little bit faster, and saving around 100MB memory. | fasttext eval | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | ------------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 24.82 | 16.56 | 33.72 | 25.84 | | wall time | 0:28.59 | 0:18.81 | 0:22.19 | 0:15.66 | | max rss | 7604752 | 7566168 | 7537456 | 7536960 | Much faster proportionally, but it was already fast, so the absolute difference is only a few seconds. This shows that the calculation of evaluation metrics has been optimized, as the classifier remains the same. Memory usage 40MB less than before. | mllm eval | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | --------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 138.03 | 139.61 | 146.27 | 135.36 | | wall time | 2:17.31 | 2:19.12 | 0:53.92 | 0:45.33 | | max rss | 417892 | 327208 | 398172 | 312432 | Overall execution time remains about the same, but memory usage has decreased by almost 100MB. | ensemble eval | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | ------------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 190.05 | 180.34 | 215.69 | 208.61 | | wall time | 3:15.67 | 3:06.35 | 1:51.35 | 1:45.40 | | max rss | 13304224 | 13248516 | 13171624 | 13176440 | Again a bit faster and uses 50MB less memory. | nn_ensemble eval | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | ---------------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 209.98 | 190.73 | 219.25 | 215.77 | | wall time | 3:32.68 | 3:07.38 | 1:53.31 | 1:48.60 | | max rss | 13834460 | 13734268 | 13592216 | 13592288 | Faster, 100MB less memory.
annif optimize (mllm, fasttext, omikuji, nn_ensemble) - much faster, esp. with new -j4 option | mllm optimize | before (main) | after (PR) -j1 | after (PR) -j4 | | ------------- | ------------- | -------------- | -------------- | | user time | 398.12 | 133.62 | 131.32 | | wall time | 6:40.40 | 2:12.46 | 1:09.95 | | max rss | 402928 | 326660 | 312268 | | fasttext optimize | before (main) | after (PR) -j1 | after (PR) -j4 | | ----------------- | ------------- | -------------- | -------------- | | user time | 310.08 | 51.17 | 61.31 | | wall time | 5:17.04 | 0:52.77 | 0:51.56 | | max rss | 7569820 | 7565416 | 7537044 | | omikuji optimize | before (main) | after (PR) -j1 | after (PR) -j4 | | --------------- | ------------- | -------------- | -------------- | | user time | 351.47 | 90.24 | 90.98 | | wall time | 5:57.19 | 1:31.88 | 1:29.20 | | max rss | 6353648 | 5923684 | 5900328 | | nn_ensemble optimize | before (main) | after (PR) -j1 | after (PR) -j4 | | -------------------- | ------------- | -------------- | -------------- | | user time | 459.32 | 195.40 | 214.07 | | wall time | 7:44.59 | 3:18.16 | 2:14.56 | | max rss | 13765600 | 13733652 | 13597472 | All of these are 2-3x faster than before with a single job, and in some cases (mllm, nn_ensemble) using the new `--jobs 4` option further improves performance. Memory usage has decreased for MLLM and Omikuji.

Reduced performance

annif suggest (mllm) with a corpus of documents - slightly slower | mllm suggest | before (main) | after (PR) | | ------------ | ------------- | ---------- | | user time | 124.31 | 136.56 | | wall time | 2:04.31 | 2:16.56 | | max rss | 312952 | 321740 | This is a little bit slower than before, probably because results have to be converted from NumPy vectors to sparse arrays which in this case just slows things down.
annif train (pav) - slightly slower | pav train | before (main) | after (PR) | | --------- | ------------- | ---------- | | user time | 1306.44 | 1374.53 | | wall time | 21:53.99 | 23:03.23 | | max rss | 13397520 | 13394764 | I didn't investigate this in detail, but clearly there is a bit of a slowdown.
annif hyperopt (ensemble) - much longer wall time due to bad threading | ensemble hyperopt | before (main) | after (PR) | | ----------------- | ------------- | ---------- | | user time | 236.50 | 238.26 | | wall time | 1:35.12 | 2:06.23 | | max rss | 13558412 | 13205052 | The user time is unchanged, but wall time has increased significantly. I did some profiling and it seems to me that the main reason for the slowdown is contention for the GIL; due to the way sparse arrays are implemented and used, eval calculations are now mostly done in Python code instead of NumPy operations, and NumPy is better at threading since it can release the GIL during some calculations. A silver lining here is that memory usage decreased by around 250MB. (See below for a suggestion on how to improve this)

No significant difference

Basic CLI operations | annif --help | before (main) | after (PR) | | ------------ | ------------- | ---------- | | user time | 1.48 | 01.09 | | wall time | 0:00.67 | 0:00.72 | | max rss | 61068 | 61000 | | annif list-projects | before (main) | after (PR) | | ------------- | ------------- | ---------- | | user time | 4.96 | 4.99 | | wall time | 0:03.39 | 0:03.44 | | max rss | 405000 | 405632 | | annif list-vocabs | before (main) | after (PR) | | ----------- | ------------- | ---------- | | user time | 4.81 | 5.26 | | wall time | 0:03.24 | 0:03.18 | | max rss | 406336 | 405564 | Lots of variation between runs, but there is no significant difference.
annif suggest (most backends) with a corpus of documents | tfidf suggest | before (main) | after (PR) | | ------------- | ------------- | ---------- | | user time | 139.18 | 137.01 | | wall time | 2:21.40 | 2:18.01 | | max rss | 540700 | 544692 | | fasttext suggest | before (main) | after (PR) | | ---------------- | ------------- | ---------- | | user time | 15.46 | 14.66 | | wall time | 0:17.32 | 0:17.18 | | max rss | 7464564 | 7466736 | | omikuji suggest | before (main) | after (PR) | | -------------- | ------------- | ---------- | | user time | 52.08 | 52.66 | | wall time | 0:55.29 | 0:55.55 | | max rss | 6266832 | 6268096 | | svc suggest (ykl) | before (main) | after (PR) | | ----------------- | ------------- | ---------- | | user time | 18.24 | 18.52 | | wall time | 0:17.87 | 0:17.68 | | max rss | 1985268 | 1985208 | | ensemble suggest | before (main) | after (PR) | | ---------------- | ------------- | ---------- | | user time | 183.46 | 180.69 | | wall time | 3:10.45 | 3:10.64 | | max rss | 13248652 | 13203956 | | pav suggest | before (main) | after (PR) | | ----------- | ------------- | ---------- | | user time | 191.90 | 192.24 | | wall time | 3:17.12 | 3:18.13 | | max rss | 13250576 | 13214236 | | nn_ensemble suggest | before (main) | after (PR) | | ------------------- | ------------- | ---------- | | user time | 192.41 | 187.32 | | wall time | 3:14.34 | 3:12.49 | | max rss | 13733072 | 13735572 |
annif eval (svc, pav) | svc eval (ykl) | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | -------------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 31.32 | 31.27 | 35.23 | 33.68 | | wall time | 0:30.85 | 0:29.71 | 0:20.64 | 0:19.18 | | max rss | 2088652 | 1949652 | 2006592 | 1881268 | This was done using YKL as the vocabulary and kirjaesittelyt2021/ykl test set. Runtime is about the same, but memory usage decreased by 130MB! | pav eval | before (main) -j1 | after (PR) -j1 | before (main) -j4 | after (PR) -j4 | | --------- | ----------------- | -------------- | ----------------- | -------------- | | user time | 189.44 | 192.63 | 232.83 | 221.67 | | wall time | 3:16.27 | 3:18.89 | 2:00.68 | 1:49.79 | | max rss | 13298404 | 13236048 | 13215296 | 13178672 | A bit faster when using `-j4`, otherwise no big difference, though memory usage decreased by 60MB.
annif train (nn_ensemble) | nn_ensemble train | before (main) -j4 | after (PR) -j4 | | ----------------- | ----------------- | -------------- | | user time | 4243.75 | 4292.64 | | wall time | 8:41.98 | 8:43.56 | | max rss | 15620888 | 16439540 | Overall train time remains about the same, but memory usage has increased. I think there is room for improvement here, as noted below.

Code quality

On the whole, this PR reduces the amount of code by more than 100 lines, mainly because having a single represenation for suggestion results simplifies the implementation. I've also both dropped some unit tests and added new ones; the total number of tests has grown by one.

I tried to make the new implementation as clean as possible, but there are a few places where Codebeat complains about too much complexity. In both SuggestionBatch.from_sequence and filter_suggestions the problem is that efficiently constructing a sparse array is a bit complicated and messy as it involves creating three parallel lists; I originally tried to use dok_array as an intermediate representation, but it turned out to be slow. Anyway, I'm running out of ideas on how to simplify the code further. Suggestions welcome!

Potential future work

I'm not quite happy with how the NN ensemble handles suggestion results from other projects, both during training and suggest operations. For example, the training samples are stored in LMDB one document at a time, but now it would be easier to store them as whole batches instead, which could be more efficient. But I decided that this PR is already much too big and it would make sense to try to improve batching in the NN ensemble in a separate follow-up PR. There is already an attempt to do part of this in PR #676; that could be a possible starting point.

The threading performance of annif hyperopt was already bad, and now it got worse. The solution could be to switch to process-based multiprocessing. This has been difficult to do with Optuna (needs an external relational database), but the Optuna FAQ now states that it could also be done with JournalFileStorage, which sounds more promising.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 :tada:

Comparison is base (693ab21) 99.58% compared to head (3edab48) 99.61%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #681 +/- ## ========================================== + Coverage 99.58% 99.61% +0.02% ========================================== Files 89 89 Lines 6268 6166 -102 ========================================== - Hits 6242 6142 -100 + Misses 26 24 -2 ``` | [Impacted Files](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/util.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvdXRpbC5weQ==) | `98.41% <ø> (-0.16%)` | :arrow_down: | | [annif/backend/backend.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9iYWNrZW5kLnB5) | `100.00% <100.00%> (ø)` | | | [annif/backend/dummy.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9kdW1teS5weQ==) | `100.00% <100.00%> (ø)` | | | [annif/backend/ensemble.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9lbnNlbWJsZS5weQ==) | `100.00% <100.00%> (ø)` | | | [annif/backend/fasttext.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9mYXN0dGV4dC5weQ==) | `100.00% <100.00%> (ø)` | | | [annif/backend/http.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9odHRwLnB5) | `100.00% <100.00%> (ø)` | | | [annif/backend/mixins.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9taXhpbnMucHk=) | `97.50% <100.00%> (+2.37%)` | :arrow_up: | | [annif/backend/mllm.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9tbGxtLnB5) | `100.00% <100.00%> (ø)` | | | [annif/backend/nn\_ensemble.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9ubl9lbnNlbWJsZS5weQ==) | `100.00% <100.00%> (ø)` | | | [annif/backend/omikuji.py](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9vbWlrdWppLnB5) | `97.53% <100.00%> (ø)` | | | ... and [27 more](https://codecov.io/gh/NatLibFi/Annif/pull/681?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

osma commented 1 year ago

Rebased on current main and force-pushed, mainly because I need to consider the CLI refactoring done in #675 before making changes to CLI-related code.

sonarcloud[bot] commented 1 year ago

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
4.8% 4.8% Duplication