datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

Unclean removal of sibling #44

Open jennydaman opened 1 year ago

jennydaman commented 1 year ago

What is the problem?

I want to be able to completely reset the configuration of siblings for a dataset. However, the datalad siblings remove command does not cleanly remove all configurations for the dataset.

What steps will reproduce the problem?

First, create an example dataset:

datalad create wow
cd wow
datalad run 'echo data > something.dat'
datalad create-sibling-ria -s ria --new-store-ok ria+file:///tmp/ria

Next, I attempt to undo the effects of datalad create-sibling-ria:

rm -rf /tmp/ria
datalad siblings remove -s ria
datalad siblings remove -s ria-storage

At this point, as expected datalad siblings reports that the only sibling is here:

datalad siblings
.: here(+) [git]

However, I am unable to recreate the sibling:

datalad create-sibling-ria -s ria --new-store-ok ria+file:///tmp/ria
create-sibling-ria(error): /home/jenni/wow (sibling) [a sibling 'ria-storage' is already configured in dataset '/home/jenni/wow']

Moreover, a TypeError is encountered if you use --existing reconfigure:

datalad create-sibling-ria -s ria --new-store-ok ria+file:///tmp/ria --existing reconfigure
[INFO   ] create siblings 'ria' and 'ria-storage' ...
[ERROR  ] expected str, bytes or os.PathLike object, not NoneType

DataLad information

# WTF
## configuration <SENSITIVE, report disabled by configuration>
## credentials
  - keyring:
    - active_backends:
      - SecretService Keyring
      - PlaintextKeyring with no encyption v.1.0 at /home/jenni/.local/share/python_keyring/keyring_pass.cfg
    - config_file: /home/jenni/.config/python_keyring/keyringrc.cfg
    - data_root: /home/jenni/.local/share/python_keyring
## datalad
  - version: 0.19.3
## dataset
  - branches:
    - git-annex@98bb2cc
    - master@d2d7244
  - id: 3835c8ea-f3cb-463c-9e1d-ce33adf71516
  - path: /home/jenni/wow
  - repo: AnnexRepo
## dependencies
  - annexremote: 1.6.0
  - boto: 2.49.0
  - cmd:7z: 16.02
  - cmd:annex: 10.20230626-g801c4b7
  - cmd:bundled-git: UNKNOWN
  - cmd:git: 2.42.0
  - cmd:ssh: 9.3p1
  - cmd:system-git: 2.42.0
  - cmd:system-ssh: 9.3p1
  - humanize: 4.8.0
  - iso8601: 2.0.0
  - keyring: 24.2.0
  - keyrings.alt: 4.2.0
  - msgpack: 1.0.5
  - platformdirs: 3.10.0
  - requests: 2.31.0
## environment
  - LANG: en_US.UTF-8
  - LC_MESSAGES:
  - LC_TIME: en_DK.UTF-8
  - PATH: /home/jenni/micromamba/envs/datalad/bin:/home/jenni/.local/share/pnpm:/home/jenni/micromamba/condabin:/home/jenni/opt/bin:/home/jenni/opt/itksnap-4.0.0-20230220-Linux-gcc64/bin:/home/jenni/bin:/home/jenni/.local/bin:/home/jenni/.cargo/bin:/usr/lib/ccache/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/var/lib/flatpak/exports/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/home/jenni/.local/share/gem/ruby/3.0.0/bin:/home/jenni/.yarn/bin:/home/jenni/.conda/envs/datalad/bin
## extensions
## git-annex
  - build flags:
    - Assistant
    - Webapp
    - Pairing
    - Inotify
    - DBus
    - DesktopNotify
    - TorrentParser
    - MagicMime
    - Benchmark
    - Feeds
    - Testsuite
    - S3
    - WebDAV
  - dependency versions:
    - aws-0.22
    - bloomfilter-2.0.1.0
    - cryptonite-0.29
    - DAV-1.3.4
    - feed-1.3.2.0
    - ghc-8.10.7
    - http-client-0.7.9
    - persistent-sqlite-2.13.0.3
    - torrent-10000.1.1
    - uuid-1.3.15
    - yesod-1.6.1.2
  - key/value backends:
    - SHA256E
    - SHA256
    - SHA512E
    - SHA512
    - SHA224E
    - SHA224
    - SHA384E
    - SHA384
    - SHA3_256E
    - SHA3_256
    - SHA3_512E
    - SHA3_512
    - SHA3_224E
    - SHA3_224
    - SHA3_384E
    - SHA3_384
    - SKEIN256E
    - SKEIN256
    - SKEIN512E
    - SKEIN512
    - BLAKE2B256E
    - BLAKE2B256
    - BLAKE2B512E
    - BLAKE2B512
    - BLAKE2B160E
    - BLAKE2B160
    - BLAKE2B224E
    - BLAKE2B224
    - BLAKE2B384E
    - BLAKE2B384
    - BLAKE2BP512E
    - BLAKE2BP512
    - BLAKE2S256E
    - BLAKE2S256
    - BLAKE2S160E
    - BLAKE2S160
    - BLAKE2S224E
    - BLAKE2S224
    - BLAKE2SP256E
    - BLAKE2SP256
    - BLAKE2SP224E
    - BLAKE2SP224
    - SHA1E
    - SHA1
    - MD5E
    - MD5
    - WORM
    - URL
    - X*
  - local repository version: 10
  - operating system: linux x86_64
  - remote types:
    - git
    - gcrypt
    - p2p
    - S3
    - bup
    - directory
    - rsync
    - web
    - bittorrent
    - webdav
    - adb
    - tahoe
    - glacier
    - ddar
    - git-lfs
    - httpalso
    - borg
    - hook
    - external
  - supported repository versions:
    - 8
    - 9
    - 10
  - upgrade supported from repository versions:
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
    - 7
    - 8
    - 9
    - 10
  - version: 10.20230626-g801c4b7
## location
  - path: /home/jenni/wow
  - type: dataset
## metadata.extractors
## metadata.filters
## metadata.indexers
## python
  - implementation: CPython
  - version: 3.11.4
## system
  - distribution: arch
  - encoding:
    - default: utf-8
    - filesystem: utf-8
    - locale.prefered: UTF-8
  - filesystem:
    - CWD:
      - max_pathlength: 4096
      - mount_opts: rw,relatime,compress=zstd:3,ssd,space_cache=v2,subvolid=263,subvol=/archlinux-root
      - path: /home/jenni/wow
      - type: btrfs
    - HOME:
      - max_pathlength: 4096
      - mount_opts: rw,relatime,compress=zstd:3,ssd,space_cache=v2,subvolid=263,subvol=/archlinux-root
      - path: /home/jenni
      - type: btrfs
    - TMP:
      - max_pathlength: 4096
      - mount_opts: rw,relatime,compress=zstd:3,ssd,space_cache=v2,subvolid=263,subvol=/archlinux-root
      - path: /tmp
      - type: btrfs
  - max_path_length: 271
  - name: Linux
  - release: 6.4.10-arch1-1
  - type: posix
  - version: datalad/datalad#1 SMP PREEMPT_DYNAMIC Fri, 11 Aug 2023 11:03:36 +0000

Additional context

The above is a minimal reproduction of the error I am encountering while trying to use datalad with my real data. I've encountered other problems related to removed siblings which are harder to reproduce. Some errors went away after rerunning datalad create-sibling-ria. Also, when I run datalad clone ... I get errors related to removed siblings:

datalad clone 'ria+file:///neuro/labs/grantlab/research/Jennings/var/datalad_ria#51a40309-d98a-40aa-9891-298d42215e7f' wowe
[INFO   ] RIA store unavailable. -caused by- ssh://jennings.zhang@centurion.tch.harvard.edu:/neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis/ria-layout-version not found, self.ria_store_url: ria+ssh://jennings.zhang@centurion.tch.harvard.edu:/neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis, self.store_base_pass: /neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis, self.store_base_pass_push: None, path: <class 'pathlib.PosixPath'> /neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis/ria-layout-version -caused by- /neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis/ria-layout-version not found. -caused by- cat  /neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis/ria-layout-version failed:
[INFO   ] RIA store unavailable. -caused by- file:///neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis/ria-layout-version not found, self.ria_store_url: ria+file:///neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis, self.store_base_pass: /neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis, self.store_base_pass_push: None, path: <class 'pathlib.PosixPath'> /neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis/ria-layout-version -caused by- [Errno 2] No such file or directory: '/neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis/ria-layout-version'
[INFO   ] Configure additional publication dependency on "ria-storage"
configure-sibling(ok): . (sibling)
install(ok): /tmp/wowe (dataset)
action summary:
  configure-sibling (ok: 1)
  install (ok: 1)

Here, /neuro/labs/grantlab/research/Jennings/var/datalad_ria is a valid Datalad RIA, whereas /neuro/labs/grantlab/research/Jennings/datalad_ria/innersp_fitting_data_analysis was deleted

Have you had any success using DataLad before?

No response

jennydaman commented 1 year ago

For my actual data, I tried deleting then recreating the sibling as shown above then pushing again. However it's saying the push is not needed even though the data is not present in the RIA.

 datalad siblings
.: here(+) [git]
.: ria(-) [/neuro/labs/grantlab/research/Jennings/var/datalad_ria/51a/40309-d98a-40aa-9891-298d42215e7f (git)]
.: ria-storage(+) [ora]

du -hs /neuro/labs/grantlab/research/Jennings/var/datalad_ria
31M     /neuro/labs/grantlab/research/Jennings/var/datalad_ria

du -hs .
13G     .

datalad push --to ria -r
action summary:
  copy (notneeded: 638)
  publish (notneeded: 4)
adswa commented 1 year ago

Hey @jennydaman, thanks for the detailed issue! Most of what you describe are sadly known deficiencies of the current implementation, and we have plans to completely redo the ria functionality to iron them out. Other urgent projects had delayed this so far however, and I hope we can get to it in the coming weeks.

The first issue you describe - the ria-sibling lingering around in the background - reproduces. It is not unfixable, but we don't have ready-made datalad commands to do that work. Generally you would either do a reconfiguration (with the git-annex command git annex enableremote), or use a different special remote name in your new ria store. Unlike Git remotes, git-annex special remotes are not easy to remove (a strong safe-guard to prevent two special remotes with the same name). Even when datalad siblings or git remote -v does not list them, you would find the presumably removed special remote when running git annex info. If you'd really want to remove it, you could do it by declaring the ria-storage special remote as dead (git annex dead ria-storage). This would hide it almost completely. But to be able to reuse the same special remote name from scratch you'd also need to forcefully purge it (git annex forget --drop-dead --force) (see here). But as that link highlights, this isn't really recommended from git-annex side, and reconfiguration with git annex enableremote would be the preferred way. I'll try to find some examples and documentation on this. EDIT: This is a useful example for reconfiguring a special remote. And I also just found that there is a very recent new git-annex command that allows you to rename the old special remote, which would be much easier than what I outlined above: https://git-annex.branchable.com/git-annex-renameremote/

The TypeError is bad, thanks much for the report, I will look into it.

The errors you mention also justify an apology - their on our list of known annoyances and we have plans to remove them, but didn't get to it yet. Sorry about them.

The failure to push is curious, because the create-sibling-ria command would configure a publication dependency between the ria remote that you are pushing to and the ria-storage special remote. This is what it looks like in the dataset's .git/config file (last line):

[remote "ria-storage"]
    annex-externaltype = ora
    annex-uuid = 3c97b679-851e-4f3c-8891-80ca56e9bb2b
    skipFetchAll = true
    annex-cost = 100.0
    annex-availability = GloballyAvailable
[remote "ria"]
    annex-ignore = true
    url = /tmp/ria/c0a/51fd6-981c-4170-852f-7b69be4f1867
    fetch = +refs/heads/*:refs/remotes/ria/*
    datalad-publish-depends = ria-storage

Can you check whether this configuration exists for your sibling as well? My suspicion is that this configuration is missing, so it only pushes --to ria what is in Git, and ignores all annexed contents. If the configuration is in place, have to tried an explicit push --to ria-storage?

adswa commented 1 year ago

The reason for the KeyError lies here: https://github.com/datalad/datalad/blob/40332b5ad25bf8744f7399f6c3575f7d28f71384/datalad/distributed/create_sibling_ria.py#L622-L653

When create-sibling-ria fails because the storage sibling already exists, it attempts a reconfiguration on its own. But then further down, it attempts to get the special remotes UUID from .git/config, which had been removed from there, and thus return the None that later sends the configuration command into a KeyError. If the special remote sibling wouldn't be removed with datalad siblings remove -s ria-storage, I believe your code would actually work.

jennydaman commented 1 year ago

@adswa thank you for the thorough explanation.

git annex is still a mystery to me and I haven't felt this confused since learning git for the first time. I wish there was an easy way to cleanly reset what the datalad siblings and git annex remotes (but keep my data and datalad history). I've tried deleting then repushing the 10GB RIA location 3 times but it seems to just make things worse. Well, everything works, but I keep accumulating more and more orphaned "git annex things."