datalad / datalad-osf

DataLad extension to interface with the Open Science Framework
Other
14 stars 12 forks source link

Publication dependency is lost on cloning, pushing from clone fails with mode `export` #201

Open jsheunis opened 3 months ago

jsheunis commented 3 months ago

I am not sure if this is an issue or intended functioning of cloning from a dataset on OSF. It was encountered on Linux during a DataLad workshop by @charlottemock (thanks!)

Environment

Create dataset, add file, save to git

> datalad create osfbla
create(ok): /Users/jsheunis/Documents/psyinf/Data/osfbla (dataset)

> cd osfbla
> echo 'kaas' > k.txt
> datalad save --to-git
add(ok): k.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Create osf sibling, check .git/config, push to sibling

> datalad create-sibling-osf --title YODA --mode export -s osf --public
create-sibling-osf(ok): https://osf.io/gfp9r/
[INFO   ] Configure additional publication dependency on "osf-storage"
configure-sibling(ok): . (sibling)

> cat .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
    precomposeunicode = true
[annex]
    uuid = 9d551f14-e086-452b-9ec9-1fb96923836d
    version = 10
[filter "annex"]
    smudge = git-annex smudge -- %f
    clean = git-annex smudge --clean -- %f
    process = git-annex filter-process
[remote "osf-storage"]
    annex-externaltype = osf
    annex-uuid = b1ae33ec-5144-4625-af3e-36dfc4174b1c
    skipFetchAll = true
    annex-cost = 200.0
    annex-availability = GloballyAvailable
[remote "osf"]
    annex-ignore = true
    url = osf://gfp9r
    fetch = +refs/heads/*:refs/remotes/osf/*
    datalad-publish-depends = osf-storage

> datalad push --to osf
copy(ok): .datalad/.gitattributes (dataset)
copy(ok): .datalad/config (dataset)
copy(ok): .gitattributes (dataset)
copy(ok): k.txt (dataset)
publish(ok): . (dataset) [refs/heads/main->osf:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->osf:refs/heads/git-annex [new branch]]

Check online if the result is as expected

https://osf.io/gfp9r/

Yes it is

Clone from published OSF url into different location of the same system

> datalad clone osf://gfp9r cloned_osfbla
[INFO   ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore
install(ok): /Users/jsheunis/Documents/psyinf/Data/cloned_osfbla (dataset)

> cd cloned_osfbla

> cat .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
    precomposeunicode = true
[remote "origin"]
    url = osf://gfp9r
    fetch = +refs/heads/*:refs/remotes/origin/*
    annex-ignore = true
[branch "main"]
    remote = origin
    merge = refs/heads/main
[annex]
    uuid = d4c036e4-6fad-46e9-8066-cb13f25bcfc5
    version = 10
[filter "annex"]
    smudge = git-annex smudge -- %f
    clean = git-annex smudge --clean -- %f
    process = git-annex filter-process
[remote "osf-storage"]
    annex-externaltype = osf
    annex-uuid = b1ae33ec-5144-4625-af3e-36dfc4174b1c

Here we can see that the publication dependency is missing in the clone. Also, the sibling name is origin and not osf (see datalad siblings call). I don't know if either of these (the missing publication dependency and the changed sibling name) are intentional or by design, or a problem? I couldn't find informative docs about this.

> datalad siblings
.: here(+) [git]
.: osf-storage(+) [osf]
.: origin(-) [osf://gfp9r (git)]

Add the publication dependency explicitly

> datalad siblings -s origin --publish-depends osf-storage configure
[INFO   ] Configure additional publication dependency on "osf-storage"
.: origin(-) [osf://gfp9r (git)]

This works fine, and was confirmed by inspecting .git/config

Add changes in the clone, push to origin

> echo 'kaaskoek' > kk.txt

> datalad save --to-git
add(ok): kk.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

> datalad push --to origin
publish(error): . (dataset) [refuse to export to osf-storage, because the last known export came from another repo (9d551f14-e086-452b-9ec9-1fb96923836d). Use --force=export to enforce the export anyway.]
publish(ok): . (dataset) [refs/heads/git-annex->origin:refs/heads/git-annex b5a85aa..8250143]
publish(ok): . (dataset) [refs/heads/main->origin:refs/heads/main 3142c1a..1f70404]
action summary:
  publish (error: 1, ok: 2)

This publish(error) is the second part of the issue. If I use the --force=export flag with the push, the push of the additional change succeeds.

Another note: if I don't configure the publication-dependency in the clone, and then save a change, and push the change to origin, the git refs / history is pushed, but not the actual file. This behaviour, or the need to use the --force=export flag (and why) isn't documented anywhere that I could find.

jsheunis commented 3 months ago

@datalad/developers is this all expected behaviour? If so, I think it makes sense to improve documentation (docs and docstrings) to make the use of --publish-depends and --force=export after cloning clearer for users. If not, where should we be looking to improve this?

adswa commented 2 months ago

Just leaving a few quick notes from the office hour:

mslw commented 2 months ago

And one more:

jsheunis commented 2 days ago

Thanks for the investigation and info, @adswa and @mslw. So it seems like everything is expected behaviour technically, but not necessarily intuitive for a new user. I agree that updating the export mode use case docs would be useful.