NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
188 stars 41 forks source link

Implement `annif upload` and `annif download` commands for Hugging Face Hub integration #762

Closed juhoinkinen closed 2 months ago

juhoinkinen commented 5 months ago

The organization of the uploaded files follows the 4. option of issue #760:

  1. Separate files for each project, vocab, and projects configs

Compress each project directory into its own zip file (in projects/<project-id>.zip in the repo).

Each vocabularies zip is placed in vocabs/<vocab-id>.zip, and each project configuration is placed in <project-id>.cfg in the repo root.

This option is good for caching by preventing unnecessary uploads/downloads of the projects bundle when only one is changed, and for visibility of the repo contents.

The unzip after download places the data directories directly to data/projects/ and data/vocabs/, and the project configurations as separate <project-id>.cfg files in projects.d/ directory, so after download the projects are directly usable (if projects.{cfg,toml} does not exists or by using the --projects option).

Upload

Push a set of selected projects and vocabularies to a Hugging Face Hub repository as zip files and the configurations of the projects as cfg/ini files. For example, the following command uploads all projects with ids matching *-fi to juhoinkinen/Annif-models-upload-testing:

annif upload-projects "*-fi" juhoinkinen/Annif-models-upload-testing --commit-message "[OptMsg] Upload all Finnish projects"

Download

Downloads the project and vocabulary archives and the configuration files of the projects that match the given and unzip the archives to data/ directory and places the configuration files to projects.d/ directory:

annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing

~Note that currently the download will overwrite the project and vocab dirs if they already exist.~ Edit: By default overwrite does not happen, it can be performed by adding the --force option to the command.

A git revision (commit hash, branch etc.) can be specified with --revision option.

The downloaded files remain in the huggingface_hub client cache dir.


For all uploads and downloads in the case of private repos the user needs to have logged in with huggingface-cli login, or the HF Hub token can be given also with the --token option of this Annif command.

juhoinkinen commented 5 months ago

The current approach follows the option 3 of issue #760:

3. One file for projects, one for vocab, and one for projects configs

Bundle the selected projects into one zip (yso-fi.zip) and vocabularies into another (yso.zip) and leave projects config file uncompressed. Upload the projects zip to data/projects directory and the vocab zip to data/vocabs.

But the files are uploaded just to the root of the repo with names projects.zip, vocabs.zip and projects.cfg.

One option for naming the files in the repo would be that the current (above names) is the default, but the name-in-repo-id could be given as an optional (3rd) parameter, -yso-fi in the following:

annif upload-projects *-fi juhoinkinen/Annif-models-upload-testing -yso-fi

This would create projects-yso-fi.zip, vocabs-yso-fi.zip, and projects-yso-fi.cfg in the repo. This (optionally renaming uploaded file and parameters order) follows the behaviour of huggingface-cli upload (although there the repoid is the 1st and local_path is 2nd argument; should that order be used in Annif command too...?).

Edit: Or make a directory named upload in the repo, optionally with a custom name given as the 3rd parameter, and upload projects.zip, vocabs.zip and projects.cfg there? Or restrict each repo to be used only for one set of projects, i.e. create repos FintoAI-data-YSO-fi, FintoAI-data-YSO-sv, FintoAI-data-YSO-en?


Now the project configs of the selected projects are written to a cfg file in INI format, whatever the source format is (toml, ini, directory). This is because detecting the original format (or even the path to the projects config file!) did not seem very easy, and also because the standard library has no functionality for writing TOML.

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 99.69697% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 99.64%. Comparing base (df20982) to head (6f35fff). Report is 47 commits behind head on main.

Files Patch % Lines
annif/hfh_util.py 99.30% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #762 +/- ## ========================================== - Coverage 99.65% 99.64% -0.02% ========================================== Files 89 91 +2 Lines 6404 6768 +364 ========================================== + Hits 6382 6744 +362 - Misses 22 24 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

juhoinkinen commented 5 months ago

TODO:

The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing), should the cache be cleared when successfully unzipped the datadirs and moved the configs?

Could or should these commands be flagged as experimental for now, to allow changing them?

osma commented 4 months ago

I took a look at this (sorry for the delay!) and I think this is a really good feature. Some thoughts:

  1. Agree that these commands should be marked as experimental for now. We may want to change the command options, functionality etc. Maybe in the 1.2 release the experimental status could be dropped.
  2. The commands are uploading/downloading not just the project(s), but also the vocabulary(/ies). Would it make sense to shorten the commands to just upload and download?
  3. Regarding overwriting the vocabulary on download, I think this is a problematic feature and needs some thought. For example if I already have trained models with vocabularies in my data directory, and I download another model from HF Hub, it might come with another version of the same vocabulary and thus rewrite my existing vocabulary and break existing projects. So I agree that there should be some kind of check for this and perhaps an --overwrite or --force option before blindly overwriting existing vocabularies.

I tested this on my local install by running the command annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing; I wanted to see what happens when I already have a local yso vocabulary. The overwrite failed with an exception:

$ annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
Downding project(s): yake-fi
yake-fi.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.63M/2.63M [00:00<00:00, 7.24MB/s]
yake-fi.cfg: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 470kB/s]
yso.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54.6M/54.6M [00:04<00:00, 11.0MB/s]
Traceback (most recent call last):
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/bin/annif", line 6, in <module>
    sys.exit(cli())
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/xxx/git/Annif/annif/cli.py", line 690, in run_download_projects
    cli_util.unzip(vocab_zip_local_cache_path)
  File "/home/xxx/git/Annif/annif/cli_util.py", line 316, in unzip
    zfile.extractall()  # TODO Disallow overwrite
  File "/usr/lib/python3.10/zipfile.py", line 1647, in extractall
    self._extract_member(zipinfo, path, pwd)
  File "/usr/lib/python3.10/zipfile.py", line 1701, in _extract_member
    open(targetpath, "wb") as target:
PermissionError: [Errno 13] Permission denied: '/home/xxx/git/Annif/data/vocabs/yso/subjects.dump.gz'

The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing), should the cache be cleared when successfully unzipped the datadirs and moved the configs?

Isn't the caching standard behaviour of HF Hub operations? I think just keeping the cache would be the least surprising thing to do here. Maybe there could be a --no-cache or --no-keep option to control caching? What do other tools with HF Hub integration (e.g. Ludwig) do?

osma commented 4 months ago

About overwriting vocabularies - I think it would be great if Annif would notice, that the downloaded vocabulary differs from what exists locally. It should be no problem to download two different models that use the same vocabulary:

annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-fasttext-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-tfidf-fi" juhoinkinen/Annif-models-upload-testing

But when you have a different version of the vocabulary already present, Annif could show a message such as

Your local version of the "yso" vocabulary differs from the version you are downloading. If you are sure you want to overwrite it, run the command again with the `--overwrite` option. This may break projects that use your current vocabulary.

Even better would be to list the projects that may break.

But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?

juhoinkinen commented 4 months ago

But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?

Actually the ZipFile objects allow getting ZipInfo of each member of the archive, which in turn contain a CRC-32 checksum (and the name and modification timestamp) of the uncomprossed file. And ZipInfo.from_file() class method allows to get the CRC from the existing, uncompressed files, so the checksum comparisons should be quite easy to implement.

Edit: But CRC of the objects constructed by ZipInfo.from_file() throw AttributeError: 'ZipInfo' object has no attribute 'CRC'. In the compressed archive CRC attribute exists. :confused:

This bug?

osma commented 4 months ago

If it's the above bug in zipfile, it's claimed to be

Fixed in 3.9.0

Annif just dropped 3.8 support so maybe we're good? Just need to rebase this branch.

juhoinkinen commented 4 months ago

The Python versions 3.9.18, 3.10.13 and 3.11.8 all behaved the same.

This seems strange: the code

zinfo = zipfile.ZipInfo.from_file("data/projects/yake-fi/yake-index")
print(dir(zinfo))
print(zinfo.file_size)
print(zinfo.CRC)

outputs

['CRC', 'FileHeader', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_compresslevel', '_decodeExtra', '_encodeFilenameFlags', '_end_offset', '_raw_time', 'comment', 'compress_size', 'compress_type', 'create_system', 'create_version', 'date_time', 'external_attr', 'extra', 'extract_version', 'file_size', 'filename', 'flag_bits', 'from_file', 'header_offset', 'internal_attr', 'is_dir', 'orig_filename', 'reserved', 'volume']
2627360
Traceback (most recent call last):
  File "/home/local/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/bin/annif", line 6, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/myusername/git/Annif/annif/cli.py", line 686, in run_download
    cli_util.unzip(project_zip_local_cache_path, force)
  File "/home/local/myusername/git/Annif/annif/cli_util.py", line 325, in unzip
    _is_existing(member)
  File "/home/local/myusername/git/Annif/annif/cli_util.py", line 351, in _is_existing
    print(zinfo.CRC)
          ^^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'

The CRC attribute is created using ZipInfo.__slots__(), but then it is not initialized when running ZipInfo.from_file()?

Also trying to access zinfo.FileHeader() fails similarly, with Traceback ending

...
File "/usr/lib/python3.11/zipfile.py", line 453, in FileHeader
    CRC = self.CRC
          ^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'

Anyway, the CRC-32 checksums for the existing, uncompressed files could be calculated by binascii.crc32().

juhoinkinen commented 4 months ago

I added --force option to overwrite existing local files (project and vocabulary data files and configs), and modified the unzip/config-copy logic so that when the contents are identical (downloaded vs local), there the unzip/config-copy is skipped.

I did not see a reason to restrict the content comparison to only vocabularies(?) (and it was easier to implement both projects and vocabs :slightly_smiling_face:).

Demo with

#!/bin/bash

rm -r data/projects/{fasttext-fi,tfidf-fi,yake-fi}

echo "# Initial download"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo "# Second download with identical content, no-op"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing

echo $RANDOM > data/projects/tfidf-fi/file.txt
echo $RANDOM >> projects.d/tfidf-fi.cfg
echo "# Third download over changed content, skips file overwrites with complains"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing

echo "# Forced third download over changed content"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing --force
tree data

The output:

# Initial download
Downloading project(s): tfidf-fi
# Second download with identical content, no-op
Downloading project(s): tfidf-fi
# Third download over changed content, skips file overwrites with complains
Downloading project(s): tfidf-fi
Not overwriting data/projects/tfidf-fi/file.txt (use --force to override)
Not overwriting projects.d/tfidf-fi.cfg (use --force to override)
# Forced third download over changed content
Downloading project(s): tfidf-fi
data
├── projects
│   └── tfidf-fi
│       └── file.txt
└── vocabs
    └── yso
        ├── subjects.csv
        ├── subjects.dump.gz
        └── subjects.ttl

4 directories, 4 files
juhoinkinen commented 4 months ago

Force-pushed to squash multiple refactoring and renaming commits.

juhoinkinen commented 3 months ago

I noticed that after

annif download yso-bonsai-en juhoinkinen/Annif-models-upload-testing

the modification time of omikuji projects as detected by Annif is wrong; it is shown to be the extraction time:

annif show-project yso-bonsai-en 
...
Modification time: 2024-03-13 11:36:22

This is because of the timetamp of the data/projects/yso-bonsai-en/omikuji-model which is set to the time when the files it contains get extracted.

This PR implements restoration of the timestamps of the extracted files and directories, but this happens zipmember-by-zipmember, and typically the project subdirectories are extracted first (yso-bonsai-en/omikuji-model/).

I think the modification time detection should be changed to disregard subdirectories, and not to try handle this issue in this PR.

juhoinkinen commented 3 months ago

Now I see that the --revision option of the download command is not so useful: each file gets uploaded separately, and so they are included in separate commits with different commit hashes (e.g. for the project config file /yso-mllm-en.cfg vs. the related data zip file projects/yso-mllm-en.zip).

So this happens if I try to use a revision with the commit hash of projects/yso-mllm-en.zip:

annif download yso-mllm-en NatLibFi/FintoAI-data-YSO --revision 0ff6f970347ff671470f7e33ed4ea1a949e02597
Downloading project(s): yso-mllm-en
Error: Operation failed: 404 Client Error. (Request ID: Root=1-65f19f92-22b20ed73b53cacb547eda8c;1c4d5379-92cf-41b6-baa0-51e2b11c837f)

Entry Not Found for url: https://huggingface.co/NatLibFi/FintoAI-data-YSO/resolve/0ff6f970347ff671470f7e33ed4ea1a949e02597/yso-mllm-en.cfg.

Using a branch name as the revision works (--revision main), but now there is no way to set it in the upload command. It is not possible to create a new branch when using HFApi.upload_file.

So at this time I'm in favor of dropping the --revision option from download command. OFC there is a way to create a branch with HFApi.create_branch, but I don't think it is not so needed at this time.

Edit: I already removed --revision option.

Any thoughts @osma?

juhoinkinen commented 3 months ago

I created a Wiki page about the HFH integration: https://github.com/NatLibFi/Annif/wiki/Hugging-Face-Hub-integration.

juhoinkinen commented 3 months ago

I think this is ready for review.

Some points to note and consider:

juhoinkinen commented 3 months ago

Now I see that the --revision option of the download command is not so useful: each file gets uploaded separately, and so they are included in separate commits with different commit hashes (e.g. for the project config file /yso-mllm-en.cfg vs. the related data zip file projects/yso-mllm-en.zip).

So this happens if I try to use a revision with the commit hash of projects/yso-mllm-en.zip:

annif download yso-mllm-en NatLibFi/FintoAI-data-YSO --revision 0ff6f970347ff671470f7e33ed4ea1a949e02597
Downloading project(s): yso-mllm-en
Error: Operation failed: 404 Client Error. (Request ID: Root=1-65f19f92-22b20ed73b53cacb547eda8c;1c4d5379-92cf-41b6-baa0-51e2b11c837f)

Entry Not Found for url: https://huggingface.co/NatLibFi/FintoAI-data-YSO/resolve/0ff6f970347ff671470f7e33ed4ea1a949e02597/yso-mllm-en.cfg.

Using a branch name as the revision works (--revision main), but now there is no way to set it in the upload command. It is not possible to create a new branch when using HFApi.upload_file.

So at this time I'm in favor of dropping the --revision option from download command. OFC there is a way to create a branch with HFApi.create_branch, but I don't think it is not so needed at this time.

Edit: I already removed --revision option.

Any thoughts @osma?

Actually multiple files could be included in one commit using CommitOperationAdd() and create_commit(), but it seems that for this all the included files should exist until running create_commit(), and the zips can be quite large.

Edit: Now when thinking this, the possibility to download a specific "published version" of the models would be very valuable and it is not possible with the current code. The --revision option should be restored, and maybe it would be enough to instruct to use it to target just branches or tags (which I assume works with the current upload_file() approach) and not to target commits.

Edit2: But there is still the question of to how to tag the files. There are methods for tagging, huggingface_hub.Repository.add_tag() and huggingface_hub.HfApi.create_tag (), but tagging is not possible with the huggingface-cli, which would have been the easiest option. And I don't see a way to create tags in Hugging Face Hub website either. So tagging should be implemented in Annif, if it is wanted to be possible to use tags.

juhoinkinen commented 3 months ago

I commented on huggingface-hub repository about git tags, and they promptly created an issue for creating, listing and deleting tags with the huggingface-cli (also wrote a question on HF forum).

If/when it gets implemented, I think it would be most convenient for the case of Annif models if the tagging would be performed after annif upload <projects-pattern> <repo-id> using huggingface-cli, e.g. like huggingface-cli tag <repo-id> <tag-name>.

juhoinkinen commented 3 months ago

Actually multiple files could be included in one commit using CommitOperationAdd() and create_commit(), but it seems that for this all the included files should exist until running create_commit(), and the zips can be quite large.

The preupload_lfs_files() function helped to put all files in one commit but to upload them separately.

When testing with yso-*fi projects of FintoAI (10.6 GBs uncompressed, 8.8 GBs compressed) /usr/bin/time -v reported Maximum resident set size (kbytes): 49616 (and I did not notice much increase in disk space usage); upload took about 20 min at the office.

juhoinkinen commented 3 months ago

While uploading projects to FintoAI-data-YSO with annif upload "*" NatLibFi/FintoAI-data-YSO, this error occurred on two tries:

...
  File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 259, in prepare_datadir_commit
    fobj = _archive_dir(data_dir)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 285, in _archive_dir
    with zipfile.ZipFile(fp, mode="w") as zfile:
  File "/usr/lib/python3.11/zipfile.py", line 1355, in __exit__
    self.close()
  File "/usr/lib/python3.11/zipfile.py", line 1911, in close
    self.fp.seek(self.start_dir)
OSError: [Errno 28] No space left on device

However I had about 60 GBs of free disk space. And preuploading should free memory of the uploaded objects, so it should not be memory issue either... Anyway, this could be circumvented by uploading only projects of one language by one command.

sonarcloud[bot] commented 2 months ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
2 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

osma commented 2 months ago

:tada: