NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
195 stars 41 forks source link

Support batch suggest in STWFSA backend #666

Open osma opened 1 year ago

osma commented 1 year ago

PR #663 is going to bring support for batch suggest operations.

The STWFSA backend could benefit from implementing _suggest_batch instead of _suggest. It could process a batch of texts with parallel and/or vector operations.

osma commented 1 year ago

I tried this, the code is on the branch issue666-suggest-batch-stwfsa.

Unfortunately, the results were not very encouraging. Batched suggest done this way seems to be slower than the original. Maybe switching to a new representation for suggestion results (see #678) could help.

I also tried using the predict_proba method of stwfsapy, which returns the results as a sparse matrix. But here the problem is that stwfsapy internally uses different numeric IDs for concepts than Annif, so there would have to be an ID mapping mechanism to convert the results into something that Annif can use.

osma commented 1 year ago

I'm too lazy to make a table, but here are the main test results. I'm evaluating a YSO STWFSA English model on jyu-theses/eng-test on my 4 core laptop.

Before (master)

1 job

User time (seconds): 201.96
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:23.56

4 jobs

User time (seconds): 288.02
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:19.72

After (issue666-suggest-batch-stwfsa branch)

1 job

User time (seconds): 181.12
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:02.69

4 jobs

User time (seconds): 322.29
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:27.98

Summary

Evaluation was faster when using just 1 job, but slower with 4 jobs. I didn't include memory usage but it was basically unchanged.