NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
190 stars 41 forks source link

Add XTransformer backend #716

Open juhoinkinen opened 1 year ago

juhoinkinen commented 1 year ago

The code in this PR is by @mo-fu and originally submitted via PR https://github.com/NatLibFi/Annif/pull/540. That PR got accidentally closed and could not be re-opened, which is why this new PR needs to be opened for the XTransformer backend. (This PR is coming from the point of the git history just before the unsuccessful commits attempting to make the original PR re-openable.)

The description of the original PR is below.


This PR adds XTransformer as an optional backend to Annif. For now it does not yet use distilbert in the default configuration as this is not yet available on pypi.

The tests for the backend resort to mocking as training would download a pretrained model of size at least 500 mb. Also we should discuss cache directories. At the moment xtransformer will download models from the huggingface hub to ~/.cache/huggingface Is this behavior desired for Annif or should the cache be placed in the data folder?

I also haven't modified the docker container yet. When I installed pecos in a venv it required BLAS libraries so this would probably have to be added to the container. Additionally pecos will install the GPU enabled pytorch. Meaning the container size will grow. Therefore I wanted to check with you first before adding it.

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information