Closed juhoinkinen closed 2 months ago
The current approach follows the option 3 of issue #760:
3. One file for projects, one for vocab, and one for projects configs
Bundle the selected projects into one zip (yso-fi.zip) and vocabularies into another (yso.zip) and leave projects config file uncompressed. Upload the projects zip to data/projects directory and the vocab zip to data/vocabs.
But the files are uploaded just to the root of the repo with names projects.zip
, vocabs.zip
and projects.cfg
.
One option for naming the files in the repo would be that the current (above names) is the default, but the name-in-repo-id could be given as an optional (3rd) parameter, -yso-fi
in the following:
annif upload-projects *-fi juhoinkinen/Annif-models-upload-testing -yso-fi
This would create projects-yso-fi.zip
, vocabs-yso-fi.zip
, and projects-yso-fi.cfg
in the repo. This (optionally renaming uploaded file and parameters order) follows the behaviour of huggingface-cli upload
(although there the repoid is the 1st and local_path is 2nd argument; should that order be used in Annif command too...?).
Edit: Or make a directory named upload
in the repo, optionally with a custom name given as the 3rd parameter, and upload projects.zip
, vocabs.zip
and projects.cfg
there? Or restrict each repo to be used only for one set of projects, i.e. create repos FintoAI-data-YSO-fi, FintoAI-data-YSO-sv, FintoAI-data-YSO-en?
Now the project configs of the selected projects are written to a cfg file in INI format, whatever the source format is (toml, ini, directory). This is because detecting the original format (or even the path to the projects config file!) did not seem very easy, and also because the standard library has no functionality for writing TOML.
Attention: Patch coverage is 99.69697%
with 1 lines
in your changes are missing coverage. Please review.
Project coverage is 99.64%. Comparing base (
df20982
) to head (6f35fff
). Report is 47 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
annif/hfh_util.py | 99.30% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
TODO:
--force
option to allow overwrite of existing files when downloading; prevent overwrite otherwisehuggingface_hub
methods to assert their calls, e.g. @mock.patch("huggingface_hub.commands.upload.HfApi.upload_file")
huggingface_hub
imports to functions[x] Add upload and download commands to CLI documentation at readthedocs (or check they get automatically included)
@osma could you take a look at the working principles of the commands before I spend more time with this?
The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing
), should the cache be cleared when successfully unzipped the datadirs and moved the configs?
Could or should these commands be flagged as experimental for now, to allow changing them?
I took a look at this (sorry for the delay!) and I think this is a really good feature. Some thoughts:
upload
and download
?--overwrite
or --force
option before blindly overwriting existing vocabularies.I tested this on my local install by running the command annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
; I wanted to see what happens when I already have a local yso
vocabulary. The overwrite failed with an exception:
$ annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
Downding project(s): yake-fi
yake-fi.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.63M/2.63M [00:00<00:00, 7.24MB/s]
yake-fi.cfg: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 470kB/s]
yso.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54.6M/54.6M [00:04<00:00, 11.0MB/s]
Traceback (most recent call last):
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/bin/annif", line 6, in <module>
sys.exit(cli())
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/flask/cli.py", line 357, in decorator
return __ctx.invoke(f, *args, **kwargs)
File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/xxx/git/Annif/annif/cli.py", line 690, in run_download_projects
cli_util.unzip(vocab_zip_local_cache_path)
File "/home/xxx/git/Annif/annif/cli_util.py", line 316, in unzip
zfile.extractall() # TODO Disallow overwrite
File "/usr/lib/python3.10/zipfile.py", line 1647, in extractall
self._extract_member(zipinfo, path, pwd)
File "/usr/lib/python3.10/zipfile.py", line 1701, in _extract_member
open(targetpath, "wb") as target:
PermissionError: [Errno 13] Permission denied: '/home/xxx/git/Annif/data/vocabs/yso/subjects.dump.gz'
The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing), should the cache be cleared when successfully unzipped the datadirs and moved the configs?
Isn't the caching standard behaviour of HF Hub operations? I think just keeping the cache would be the least surprising thing to do here. Maybe there could be a --no-cache
or --no-keep
option to control caching? What do other tools with HF Hub integration (e.g. Ludwig) do?
About overwriting vocabularies - I think it would be great if Annif would notice, that the downloaded vocabulary differs from what exists locally. It should be no problem to download two different models that use the same vocabulary:
annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-fasttext-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-tfidf-fi" juhoinkinen/Annif-models-upload-testing
But when you have a different version of the vocabulary already present, Annif could show a message such as
Your local version of the "yso" vocabulary differs from the version you are downloading. If you are sure you want to overwrite it, run the command again with the `--overwrite` option. This may break projects that use your current vocabulary.
Even better would be to list the projects that may break.
But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?
But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?
Actually the ZipFile objects allow getting ZipInfo of each member of the archive, which in turn contain a CRC-32 checksum (and the name and modification timestamp) of the uncomprossed file. And ZipInfo.from_file() class method allows to get the CRC from the existing, uncompressed files, so the checksum comparisons should be quite easy to implement.
Edit: But CRC of the objects constructed by ZipInfo.from_file()
throw AttributeError: 'ZipInfo' object has no attribute 'CRC'
. In the compressed archive CRC attribute exists. :confused:
If it's the above bug in zipfile, it's claimed to be
Fixed in 3.9.0
Annif just dropped 3.8 support so maybe we're good? Just need to rebase this branch.
The Python versions 3.9.18, 3.10.13 and 3.11.8 all behaved the same.
This seems strange: the code
zinfo = zipfile.ZipInfo.from_file("data/projects/yake-fi/yake-index")
print(dir(zinfo))
print(zinfo.file_size)
print(zinfo.CRC)
outputs
['CRC', 'FileHeader', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_compresslevel', '_decodeExtra', '_encodeFilenameFlags', '_end_offset', '_raw_time', 'comment', 'compress_size', 'compress_type', 'create_system', 'create_version', 'date_time', 'external_attr', 'extra', 'extract_version', 'file_size', 'filename', 'flag_bits', 'from_file', 'header_offset', 'internal_attr', 'is_dir', 'orig_filename', 'reserved', 'volume']
2627360
Traceback (most recent call last):
File "/home/local/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/bin/annif", line 6, in <module>
sys.exit(cli())
^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/flask/cli.py", line 357, in decorator
return __ctx.invoke(f, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/local/myusername/git/Annif/annif/cli.py", line 686, in run_download
cli_util.unzip(project_zip_local_cache_path, force)
File "/home/local/myusername/git/Annif/annif/cli_util.py", line 325, in unzip
_is_existing(member)
File "/home/local/myusername/git/Annif/annif/cli_util.py", line 351, in _is_existing
print(zinfo.CRC)
^^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'
The CRC attribute is created using ZipInfo.__slots__()
, but then it is not initialized when running ZipInfo.from_file()
?
Also trying to access zinfo.FileHeader()
fails similarly, with Traceback ending
...
File "/usr/lib/python3.11/zipfile.py", line 453, in FileHeader
CRC = self.CRC
^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'
Anyway, the CRC-32 checksums for the existing, uncompressed files could be calculated by binascii.crc32()
.
I added --force
option to overwrite existing local files (project and vocabulary data files and configs), and modified the unzip/config-copy logic so that when the contents are identical (downloaded vs local), there the unzip/config-copy is skipped.
I did not see a reason to restrict the content comparison to only vocabularies(?) (and it was easier to implement both projects and vocabs :slightly_smiling_face:).
Demo with
#!/bin/bash
rm -r data/projects/{fasttext-fi,tfidf-fi,yake-fi}
echo "# Initial download"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo "# Second download with identical content, no-op"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo $RANDOM > data/projects/tfidf-fi/file.txt
echo $RANDOM >> projects.d/tfidf-fi.cfg
echo "# Third download over changed content, skips file overwrites with complains"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo "# Forced third download over changed content"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing --force
tree data
The output:
# Initial download
Downloading project(s): tfidf-fi
# Second download with identical content, no-op
Downloading project(s): tfidf-fi
# Third download over changed content, skips file overwrites with complains
Downloading project(s): tfidf-fi
Not overwriting data/projects/tfidf-fi/file.txt (use --force to override)
Not overwriting projects.d/tfidf-fi.cfg (use --force to override)
# Forced third download over changed content
Downloading project(s): tfidf-fi
data
├── projects
│ └── tfidf-fi
│ └── file.txt
└── vocabs
└── yso
├── subjects.csv
├── subjects.dump.gz
└── subjects.ttl
4 directories, 4 files
Force-pushed to squash multiple refactoring and renaming commits.
I noticed that after
annif download yso-bonsai-en juhoinkinen/Annif-models-upload-testing
the modification time of omikuji projects as detected by Annif is wrong; it is shown to be the extraction time:
annif show-project yso-bonsai-en
...
Modification time: 2024-03-13 11:36:22
This is because of the timetamp of the data/projects/yso-bonsai-en/omikuji-model
which is set to the time when the files it contains get extracted.
This PR implements restoration of the timestamps of the extracted files and directories, but this happens zipmember-by-zipmember, and typically the project subdirectories are extracted first (yso-bonsai-en/omikuji-model/
).
I think the modification time detection should be changed to disregard subdirectories, and not to try handle this issue in this PR.
Now I see that the --revision
option of the download command is not so useful: each file gets uploaded separately, and so they are included in separate commits with different commit hashes (e.g. for the project config file /yso-mllm-en.cfg
vs. the related data zip file projects/yso-mllm-en.zip
).
So this happens if I try to use a revision with the commit hash of projects/yso-mllm-en.zip
:
annif download yso-mllm-en NatLibFi/FintoAI-data-YSO --revision 0ff6f970347ff671470f7e33ed4ea1a949e02597
Downloading project(s): yso-mllm-en
Error: Operation failed: 404 Client Error. (Request ID: Root=1-65f19f92-22b20ed73b53cacb547eda8c;1c4d5379-92cf-41b6-baa0-51e2b11c837f)
Entry Not Found for url: https://huggingface.co/NatLibFi/FintoAI-data-YSO/resolve/0ff6f970347ff671470f7e33ed4ea1a949e02597/yso-mllm-en.cfg.
Using a branch name as the revision works (--revision main
), but now there is no way to set it in the upload command. It is not possible to create a new branch when using HFApi.upload_file
.
So at this time I'm in favor of dropping the --revision
option from download command. OFC there is a way to create a branch with HFApi.create_branch
, but I don't think it is not so needed at this time.
Edit: I already removed --revision
option.
Any thoughts @osma?
I created a Wiki page about the HFH integration: https://github.com/NatLibFi/Annif/wiki/Hugging-Face-Hub-integration.
I think this is ready for review.
Some points to note and consider:
cli_util.py
to util.py
to shorten the file, but what...
annif download <PROJECT_IDS_PATTERN> <REPO_ID>
actually good? huggingface-cli download
expects first repo-id and then the file(s). However, Annif commands generally use the pattern annif <command> <project-id>
, so I chose the current order to be in line with other commands.Now I see that the
--revision
option of the download command is not so useful: each file gets uploaded separately, and so they are included in separate commits with different commit hashes (e.g. for the project config file/yso-mllm-en.cfg
vs. the related data zip fileprojects/yso-mllm-en.zip
).So this happens if I try to use a revision with the commit hash of
projects/yso-mllm-en.zip
:annif download yso-mllm-en NatLibFi/FintoAI-data-YSO --revision 0ff6f970347ff671470f7e33ed4ea1a949e02597 Downloading project(s): yso-mllm-en Error: Operation failed: 404 Client Error. (Request ID: Root=1-65f19f92-22b20ed73b53cacb547eda8c;1c4d5379-92cf-41b6-baa0-51e2b11c837f) Entry Not Found for url: https://huggingface.co/NatLibFi/FintoAI-data-YSO/resolve/0ff6f970347ff671470f7e33ed4ea1a949e02597/yso-mllm-en.cfg.
Using a branch name as the revision works (
--revision main
), but now there is no way to set it in the upload command. It is not possible to create a new branch when usingHFApi.upload_file
.So at this time I'm in favor of dropping the
--revision
option from download command. OFC there is a way to create a branch withHFApi.create_branch
, but I don't think it is not so needed at this time.Edit: I already removed
--revision
option.Any thoughts @osma?
Actually multiple files could be included in one commit using CommitOperationAdd()
and create_commit()
, but it seems that for this all the included files should exist until running create_commit()
, and the zips can be quite large.
Edit: Now when thinking this, the possibility to download a specific "published version" of the models would be very valuable and it is not possible with the current code. The --revision
option should be restored, and maybe it would be enough to instruct to use it to target just branches or tags (which I assume works with the current upload_file()
approach) and not to target commits.
Edit2: But there is still the question of to how to tag the files. There are methods for tagging, huggingface_hub.Repository.add_tag()
and huggingface_hub.HfApi.create_tag ()
, but tagging is not possible with the huggingface-cli
, which would have been the easiest option. And I don't see a way to create tags in Hugging Face Hub website either. So tagging should be implemented in Annif, if it is wanted to be possible to use tags.
I commented on huggingface-hub repository about git tags, and they promptly created an issue for creating, listing and deleting tags with the huggingface-cli (also wrote a question on HF forum).
If/when it gets implemented, I think it would be most convenient for the case of Annif models if the tagging would be performed after annif upload <projects-pattern> <repo-id>
using huggingface-cli, e.g. like huggingface-cli tag <repo-id> <tag-name>
.
Actually multiple files could be included in one commit using
CommitOperationAdd()
andcreate_commit()
, but it seems that for this all the included files should exist until runningcreate_commit()
, and the zips can be quite large.
The preupload_lfs_files()
function helped to put all files in one commit but to upload them separately.
When testing with yso-*fi
projects of FintoAI (10.6 GBs uncompressed, 8.8 GBs compressed) /usr/bin/time -v
reported Maximum resident set size (kbytes): 49616
(and I did not notice much increase in disk space usage); upload took about 20 min at the office.
While uploading projects to FintoAI-data-YSO with annif upload "*" NatLibFi/FintoAI-data-YSO
, this error occurred on two tries:
...
File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 259, in prepare_datadir_commit
fobj = _archive_dir(data_dir)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 285, in _archive_dir
with zipfile.ZipFile(fp, mode="w") as zfile:
File "/usr/lib/python3.11/zipfile.py", line 1355, in __exit__
self.close()
File "/usr/lib/python3.11/zipfile.py", line 1911, in close
self.fp.seek(self.start_dir)
OSError: [Errno 28] No space left on device
However I had about 60 GBs of free disk space. And preuploading should free memory of the uploaded objects, so it should not be memory issue either... Anyway, this could be circumvented by uploading only projects of one language by one command.
Issues
0 New issues
2 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code
:tada:
The organization of the uploaded files follows the 4. option of issue #760:
Each vocabularies zip is placed in
vocabs/<vocab-id>.zip
, and each project configuration is placed in<project-id>.cfg
in the repo root.This option is good for caching by preventing unnecessary uploads/downloads of the projects bundle when only one is changed, and for visibility of the repo contents.
The unzip after download places the data directories directly to
data/projects/
anddata/vocabs/
, and the project configurations as separate<project-id>.cfg
files inprojects.d/
directory, so after download the projects are directly usable (ifprojects.{cfg,toml}
does not exists or by using the--projects
option).Upload
Push a set of selected projects and vocabularies to a Hugging Face Hub repository as zip files and the configurations of the projects as cfg/ini files. For example, the following command uploads all projects with ids matching
*-fi
to juhoinkinen/Annif-models-upload-testing:Download
Downloads the project and vocabulary archives and the configuration files of the projects that match the given and unzip the archives to
data/
directory and places the configuration files toprojects.d/
directory:~Note that currently the download will overwrite the project and vocab dirs if they already exist.~ Edit: By default overwrite does not happen, it can be performed by adding the
--force
option to the command.A git revision (commit hash, branch etc.) can be specified with
--revision
option.The downloaded files remain in the huggingface_hub client cache dir.
For all uploads and downloads in the case of private repos the user needs to have logged in with
huggingface-cli login
, or the HF Hub token can be given also with the--token
option of this Annif command.