datalad / datalad-next

DataLad extension for new functionality and improved user experience
https://datalad.org
Other
9 stars 10 forks source link

archivist special remote: add support for tar archives with `.tgz` extension #517

Closed loj closed 1 year ago

loj commented 1 year ago

I'm working on building a dataset from .tgz archives using the replacement for add-archive-content demonstrated here in combination with the archivist special remote. The demo below works if the archive is a .tar.gz extension but not with .tgz. With .tgz, I need to configure the archivist.legacy-mode for a successful datalad get. Here's a quick demo:

% mkdir project
% touch project/file1.txt project/file2.txt project/file3.txt
% tar -czvf project.tgz project
% datalad create tmp && cd tmp
% cp ../project.tgz ./
% datalad save -m "add archive" project.tgz
% git annex initremote archivist type=external externaltype=archivist encryption=none autoenable=true
% archivekey=$(git annex lookupkey project.tgz)
% datalad -f json ls-file-collection tarfile project.tgz --hash md5 | jq '. | select(.type == "file")' | jq --slurp . | datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - "dl+archive:${archivekey}#path={item}&size={size}" '{item}'
% filekey=$(git annex lookupkey project/file1.txt)
% archivist_uuid=$(git annex info archivist | grep 'uuid' | cut -d ' ' -f 2)
% git annex setpresentkey $filekey $archivist_uuid 1
% datalad get project/file1.txt
get(error): project/file1.txt (file) [Could not obtain 'MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.txt' -caused by- NotImplementedError]
% datalad configuration --scope local set datalad.archivist.legacy-mode=yes                                                1 !
set_configuration(ok): . [datalad.archivist.legacy-mode=yes]
% datalad get project/file1.txt                                            
[INFO   ] datalad-archives special remote is using an extraction cache under /playground/loj/abcd/tmp3/.git/datalad/tmp/archives/8bc4249de3. Remove it with DataLad's 'clean' command to save disk space. 
get(ok): project/file1.txt (file) [from archivist...]
datalad wtf ``` # WTF ## configuration ## credentials - keyring: - active_backends: - PlaintextKeyring with no encyption v.1.0 at /home/loj/.local/share/python_keyring/keyring_pass.cfg - config_file: /home/loj/.config/python_keyring/keyringrc.cfg - data_root: /home/loj/.local/share/python_keyring ## datalad - version: 0.19.3 ## dependencies - annexremote: 1.6.0 - boto: 2.49.0 - cmd:7z: 16.02 - cmd:annex: 10.20221003 - cmd:bundled-git: UNKNOWN - cmd:git: 2.39.2 - cmd:ssh: 8.4p1 - cmd:system-git: 2.39.2 - cmd:system-ssh: 8.4p1 - humanize: 4.8.0 - iso8601: 2.1.0 - keyring: 24.2.0 - keyrings.alt: 5.0.0 - msgpack: 1.0.7 - platformdirs: 3.11.0 - requests: 2.31.0 ## environment - LANG: en_US.UTF-8 - LANGUAGE: en_US.UTF-8 - LC_ALL: en_US.UTF-8 - LC_CTYPE: en_US.UTF-8 - PATH: /home/loj/.venvs/abcd-long/bin:/home/loj/.dotfiles/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/usr/local/games:/usr/games ## extensions - container: - description: Containerized environments - entrypoints: - datalad_container.containers_add.ContainersAdd: - class: ContainersAdd - module: datalad_container.containers_add - names: - containers-add - containers_add - datalad_container.containers_list.ContainersList: - class: ContainersList - module: datalad_container.containers_list - names: - containers-list - containers_list - datalad_container.containers_remove.ContainersRemove: - class: ContainersRemove - module: datalad_container.containers_remove - names: - containers-remove - containers_remove - datalad_container.containers_run.ContainersRun: - class: ContainersRun - module: datalad_container.containers_run - names: - containers-run - containers_run - module: datalad_container - version: 1.2.3 - next: - description: What is next in DataLad - entrypoints: - datalad_next.commands.create_sibling_webdav.CreateSiblingWebDAV: - class: CreateSiblingWebDAV - module: datalad_next.commands.create_sibling_webdav - names: - create-sibling-webdav - datalad_next.commands.credentials.Credentials: - class: Credentials - module: datalad_next.commands.credentials - names: - datalad_next.commands.download.Download: - class: Download - module: datalad_next.commands.download - names: - download - datalad_next.commands.ls_file_collection.LsFileCollection: - class: LsFileCollection - module: datalad_next.commands.ls_file_collection - names: - ls-file-collection - datalad_next.commands.tree.TreeCommand: - class: TreeCommand - module: datalad_next.commands.tree - names: - tree - module: datalad_next - version: 1.0.1 ## git-annex - build flags: - Assistant - Webapp - Pairing - Inotify - DBus - DesktopNotify - TorrentParser - MagicMime - Benchmark - Feeds - Testsuite - S3 - WebDAV - dependency versions: - aws-0.22 - bloomfilter-2.0.1.0 - cryptonite-0.26 - DAV-1.3.4 - feed-1.3.0.1 - ghc-8.8.4 - http-client-0.6.4.1 - persistent-sqlite-2.10.6.2 - torrent-10000.1.1 - uuid-1.3.13 - yesod-1.6.1.0 - key/value backends: - SHA256E - SHA256 - SHA512E - SHA512 - SHA224E - SHA224 - SHA384E - SHA384 - SHA3_256E - SHA3_256 - SHA3_512E - SHA3_512 - SHA3_224E - SHA3_224 - SHA3_384E - SHA3_384 - SKEIN256E - SKEIN256 - SKEIN512E - SKEIN512 - BLAKE2B256E - BLAKE2B256 - BLAKE2B512E - BLAKE2B512 - BLAKE2B160E - BLAKE2B160 - BLAKE2B224E - BLAKE2B224 - BLAKE2B384E - BLAKE2B384 - BLAKE2BP512E - BLAKE2BP512 - BLAKE2S256E - BLAKE2S256 - BLAKE2S160E - BLAKE2S160 - BLAKE2S224E - BLAKE2S224 - BLAKE2SP256E - BLAKE2SP256 - BLAKE2SP224E - BLAKE2SP224 - SHA1E - SHA1 - MD5E - MD5 - WORM - URL - X* - operating system: linux x86_64 - remote types: - git - gcrypt - p2p - S3 - bup - directory - rsync - web - bittorrent - webdav - adb - tahoe - glacier - ddar - git-lfs - httpalso - borg - hook - external - supported repository versions: - 8 - 9 - 10 - upgrade supported from repository versions: - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - version: 10.20221003 ## location - path: /playground/loj/abcd - type: directory ## metadata.extractors - container_inspect: - distribution: datalad-container 1.2.3 - load_error: ModuleNotFoundError(No module named 'datalad_metalad') - module: datalad_container.extractors.metalad_container ## metadata.filters ## metadata.indexers ## python - implementation: CPython - version: 3.9.2 ## system - distribution: debian/11/bullseye - encoding: - default: utf-8 - filesystem: utf-8 - locale.prefered: UTF-8 - filesystem: - CWD: - path: /playground/loj/abcd - HOME: - path: /home/loj - TMP: - path: /tmp - max_path_length: 276 - name: Linux - release: 5.10.0-23-amd64 - type: posix - version: #1 SMP Debian 5.10.179-1 (2023-05-12) ```
mih commented 1 year ago

Thanks a lot for the excellent report that made it easy to spot the issue. There are two things that can be done here. The problem is indeed the .tgz extension not being used to detect the archive type.

Fix 1:

You can declare the archive type in the URL. The adjusted addurls call that does this is:

datalad -f json ls-file-collection tarfile project.tgz --hash md5 | jq '. | select(.type == "file")' | jq --slurp . | datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - "dl+archive:${archivekey}#path={item}&size={size}&atype=tar" '{item}'

(look for atype=). The docs on this are at https://docs.datalad.org/projects/next/en/latest/generated/generated/datalad_next.types.archivist.html#syntax-of-dl-archives-locators

Fix 2:

The following patch would make this unnecessary, and I think it is sensible to recognize .tgz as a TAR archive.

diff --git a/datalad_next/types/archivist.py b/datalad_next/types/archivist.py
index 12e9b2b..3c1ab49 100644
--- a/datalad_next/types/archivist.py
+++ b/datalad_next/types/archivist.py
@@ -134,6 +134,8 @@ class ArchivistLocator:
                 atype = ArchiveType.zip
             elif '.tar' in suf:
                 atype = ArchiveType.tar
+            elif '.tgz' in suf:
+                atype = ArchiveType.tar

         return cls(
             akey=akey,

I will propose a PR.