NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
188 stars 41 forks source link

Hugging Face integration #760

Closed juhoinkinen closed 2 months ago

juhoinkinen commented 5 months ago

The :hugs: Hugging Face Hub intends to facilitate the hosting and sharing of AI models and datasets (as well as demo applications), and now also NatLibFi has an organization account in the Hugging Face Hub.

The data (models and datasets) in the HF Hub live in git repositories, and git can be used to handle the data (to commit, push, pull...) . However, also direct integration of applications with HF Hub is supported using the huggingface_hub Python library, which is usable also as a CLI tool.

Annif could have the functionality to push (and pull) projects or project sets to (and from) the HF Hub. It should to be able to operate on project sets because ensemble projects require the availability of also its base projects and also because of convenience.

There could be the following CLI command to push a set of projects to HF Hub:

annif upload-projects <glob-pattern> <username/reponame> [--options]

For example

annif upload-projects yso-*fi NatLibFi/FintoAI-data-YSO

would upload the specified projects to NatLibFi/FintoAI-data-YSO repository.

The files and dirs needed to be uploaded are

Options for bundling and uploading

1. Single file

Bundle all files into one zip named: yso-fi.zip (possibly include only the configs of the selected projects). Upload to the root of the repo.

The filename could be derived by the glob pattern of the projects or it could be a required argument for the upload command (as 2nd argument, to be added to the above example).

This option would be easiest for downloads: just wget one file and unzip.

2. One file for projects and vocab, and one for projects configs

Bundle projects and vocabulary directories into one zip and leave projects config file uncompressed.

3. One file for projects, one for vocab, and one for projects configs

Bundle the selected projects into one zip (yso-fi.zip) and vocabularies into another (yso.zip) and leave projects config file uncompressed. Upload the projects zip to data/projects directory and the vocab zip to data/vocabs.

4. Separate files for each project, vocab, and projects configs

Compress each project directory into its own zip (<project-id>.zip).

For this option for downloads one should use e.g. wget --accept yso*-fi.zip for the projects.


Some details and ideas:

Downloading projects

We could also implement a feature to fetch projects from the HF Hub, for example:

annif download-project <username/reponame> <projects-set-file>[--options]

But implementing this is probably best done only after the upload functionality; downloading from the HF Hub can be done also by simply with wget or curl. However, if the download function is known to be added, the hierarchy and structure of the data files in the repo should be thought from this point of view.

davanstrien commented 5 months ago

Very excited to see this! Feel free to ping me if you need any support with anything on the HF side :)