Open jcohenadad opened 4 years ago
For public datasets using DataLad to download this is a good option. You can use DataLad to get newer snapshots as they are published. With the CLI tool, you're correct, it isn't possible to update the local copy like this yet.
thanks! i haven't thought of the datalad CLI.
can datalad be used on a dataset that was downloaded with openneuro download
? I see a .datalad
folder there, so I assume it's a yes?
EDIT 2020-06-23 15:37:27: Actually, I tried a datalad command inside the local repos and it doesn't seem to be recognized as a datalad dataset, or am i missing something?
julien-macbook:~/data/ds002900-download $ datalad status
[ERROR ] No dataset found at '/Users/julien/data/ds002900-download'. Specify a dataset to work with by providing its path via the `dataset` option, or change the current working directory to be in a dataset. [dataset.py:require_dataset:568] (NoDatasetFound)
usage: datalad status [-h] [-d DATASET] [--annex [MODE]] [--untracked MODE]
[-r] [-R LEVELS] [-e {no|commit|full}] [-t {raw|eval}]
[PATH [PATH ...]]
I'm skeptical, @jcohenadad. The datalad repos on github are exports, and the copy on openneuro.org doesn't have a .git folder.
However, you can achieve what you want with some trickery. Try setting up a directory-based import remote:
$ git clone https://github.com/OpenNeuroDatasets/ds00XYZW
$ cd ds00XYZW
$ git annex initremote ds00XYZW-local type=directory directory=~/data/ds00XYZW encryption=none importremote=yes exportremote=yes
$ git annex import --from=ds00XYZW-local
This will save you the bandwidth of having to download the actual content all over again.
The docs suggest you can do the syncing step with git annex merge ds00XYZW-local/master
.
@kousu trying your trick, i am not able to retrieve the physical files locally:
git clone https://github.com/OpenNeuroDatasets/ds002900
cd ds002900
git annex initremote ds002900-local type=directory directory=~/data/ds002900 encryption=none importremote=yes exportremote=yes
git annex import --from=ds002900-local
# checking if the physical files are there (not their links)
du -sh sub-unf
36K sub-unf
# i would have expected a few MB instead
@jcohenadad where does git annex whereis
say?
What about git annex import --from=ds002900-local master
?
julien-macbook:~/data/ds002900 $ git annex import --from=ds002900-local master
git-annex: That remote does not support imports.
The heat got to me again. Delete the cloned repo and get it again then do:
$ git annex initremote ds002900-local type=directory directory=~/data/ds002900 encryption=none importtree=yes exporttree=yes
(I misspelled the option)
hum...
The missing ds002900 snapshot should be available from GitHub now.
so, assuming this is not possible, i went ahead and created a new datalad repos:
datalad install https://github.com/OpenNeuroDatasets/ds002900.git
cd ds002900
# download all data locally
datalad get .
Then, I am trying to make a change and push back to openneuro, but i am facing the following issue:
# create dummy change
echo test >> CHANGES
# try to push
datalad publish
[INFO ] Will publish updated git-annex
[INFO ] Publishing Dataset(/Users/julien/data/ds002900) to origin
CommandError: 'git push --progress --porcelain origin master git-annex' failed with exitcode 128 under /Users/julien/data/ds002900
remote: Permission to OpenNeuroDatasets/ds002900.git denied to jcohenadad.
fatal: unable to access 'https://github.com/OpenNeuroDatasets/ds002900.git/': The requested URL returned error: 403
Should I switch back to the openneuro-cli command for uploads? If so, will the symlinks under each sub- folder be a problem? (as opposed to having the "real" files under the sub- folders when using openneuro download
).
Should I switch back to the openneuro-cli command for uploads? If so, will the symlinks under each sub- folder be a problem? (as opposed to having the "real" files under the sub- folders when using
openneuro download
).
Symlinks are uploaded as though they are the target (dereferenced) so it shouldn't be an issue. One other thing to be aware of though, the CLI uploader strips git metadata out, so this will generate a new commit on the OpenNeuro side regardless of your local repository state. Pulling changes down again will result in conflicting history. We would like to support a native git push or merge request model to get around that limitation but it's not on the development roadmap currently.
following on the previous comment, i went ahead with the openneuro-cli
and did run into the problem of having the physical files present under .git.annex
being uploaded:
openneuro upload -i --dataset ds002900 ds002900/ > log_20200623165320.txt
Here is the log file (partial, because I had to stop it after realizing the whole dataset was being uploaded): log_20200623165320.txt
So, does it mean that the datalad approach suggested here does not work for syncing (both ways) an openneuro dataset, and that the only way to maintain a dataset (download, sync, modify, upload), is to do a fresh download into a new local directory? (i.e. download, modify, upload)
I figured out a method for you to use your local non-datalad content with the remote datalad changelog, @jcohenadad
I made a test dataset:
mkdir ds1
cd ds1
echo "Test data" > README.md
dd if=/dev/urandom of=brain1.nii.gz count=1M bs=1
cd ..
Then I made a git-annex copy of it (to simulate the https://openneuro.org -> git://github.com/OpenNeuroDatasets link):
cp -rp ds1 ds1-github
cd ds1-github
git init
git annex init
git add README.md
git annex add *.nii.gz
git commit -m "version 1.0.0"
cd ..
Then I simulated changes made by a collaborator (in reality they would be making these changes on their own computers and then using openneuro upload
, but it will amount to the same as this):
cd ds1-github
git annex unlock brain1.nii.gz
dd if=/dev/urandom of=brain1.nii.gz count=1M bs=1
git annex add brain1.nii.gz
git annex lock brain1.nii.gz
git annex unlock brain2.nii.gz
dd if=/dev/urandom of=brain2.nii.gz count=1M bs=1
git annex add brain2.nii.gz
git annex lock brain2.nii.gz
git add -u
git commit -m "version 1.0.1"
cd ..
Then I cloned that (to simulate the github -> local link)
git clone ds1-github ds1-clone
Now, I make a change to the non-git copy to simulate the edits you've made that you're unsure about:
cd ds1
dd if=/dev/urandom of=brain2.nii.gz count=1M bs=1
dd if=/dev/urandom of=brain3.nii.gz count=1M bs=1
cd ..
Now, merging these copies.
I use rsync here. It should be possible to use https://git-annex.branchable.com/special_remotes/directory/ in remotetree=yes mode to make merging easier.
cd ds1-clone
git annex unlock .
rsync -av ../ds1/ . # the trailing slash is key!
Now you can see the changes:
$ git diff
diff --git a/brain1.nii.gz b/brain1.nii.gz
index 75c6ebf..d8378e3 100644
--- a/brain1.nii.gz
+++ b/brain1.nii.gz
@@ -1 +1 @@
-/annex/objects/SHA256E-s1048576--3bf87d648c02937ba358a6699b329116b21b0adf7759b4ff3f4088e5cac73014.nii.gz
+/annex/objects/SHA256E-s1048576--145a9ced442f37717baebf56347449698f6b97173eea449da35afdc9db6a0fbe.nii.gz
diff --git a/brain2.nii.gz b/brain2.nii.gz
index a84c243..6321911 100644
--- a/brain2.nii.gz
+++ b/brain2.nii.gz
@@ -1 +1 @@
-/annex/objects/SHA256E-s1048576--5e29e15b94912e0a5b5c7afc2c7d247bca4d1dc20547b8cf3289dd69b487d217.nii.gz
+/annex/objects/SHA256E-s1048576--c279d83daf61febb748749808f84c160a5c82dfaf7a9090455db4d90bd422de8.nii.gz
$ git status
On branch master
Your branch is up to date with 'origin/master'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
typechange: brain1.nii.gz
typechange: brain2.nii.gz
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: brain1.nii.gz
modified: brain2.nii.gz
Untracked files:
(use "git add <file>..." to include in what will be committed)
brain3.nii.gz
You must carefully investigate the merge at this point. Any files that are marked as changed might be because you're updating them, but also might be because you're downgrading them. brain1.nii.gz
is a downgrade because you weren't sync'd with what your collaborator did before you made your own edits. brain2.nii.gz
is an upgrade because you made that edit on purpose. You can use git log
(or better: git log --stat
) to see what files have been touched lately to try to figure this out. You can use git reset -- brain1.nii.gz && git checkout -- brain1.nii.gz
to undo the downgrade.
Then, to make sure you have a complete copy of the dataset
$ git annex whereis brain1.nii.gz
whereis brain1.nii.gz (1 copy)
7ace0d5d-d595-4fc3-acda-83e2add5df04 -- kousu@laptop:~/ds1-github [origin]
ok
git annex get .
git annex unlock .
After this you should have a copy of the dataset that looks like the non-datalad copy, but with the merges applied:
$ ls -l
total 3088
-rw-r--r-- 1 kousu 1048576 Jun 25 13:22 brain1.nii.gz
-rw-r--r-- 1 kousu 1048576 Jun 25 13:16 brain2.nii.gz
-rw-r--r-- 1 kousu 1048576 Jun 25 13:16 brain3.nii.gz
-rw-r--r-- 1 kousu 10 Jun 25 12:45 README.md
Now, to reupload the changes from here, you have to use openneuro-cli
, you can't use git (yet?).
First make sure you have the tool installed:
npm install -g openneuro-cli && openneuro --help
In order to make sure to merge any changes that have happened to your dataset remotely, continue working from ds1-clone
, but keep it "unlocked" (which means the files will have their content instead of pointers to files in .git/annex), and get rid of the git parts:
mv {.datalad,.git,.gitattributes} ..
openneuro upload -d ds1 ./
mv ../{.datalad,.git,.gitattributes} .
Here is the same process using git annex import
:
I made a test dataset:
mkdir ds1
cd ds1
echo "Test data" > README.md
dd if=/dev/urandom of=brain1.nii.gz count=1M bs=1
cd ..
Then I made a git-annex copy of it (to simulate the https://openneuro.org -> git://github.com/OpenNeuroDatasets link):
cp -rp ds1 ds1-github
cd ds1-github
git init
git annex init
git add README.md
git annex add *.nii.gz
git commit -m "version 1.0.0"
cd ..
Then I simulated changes made by a collaborator (in reality they would be making these changes on their own computers and then using openneuro upload
, but it will amount to the same as this):
cd ds1-github
git annex unlock brain1.nii.gz
dd if=/dev/urandom of=brain1.nii.gz count=1M bs=1
git annex add brain1.nii.gz
git annex lock brain1.nii.gz
git annex unlock brain2.nii.gz
dd if=/dev/urandom of=brain2.nii.gz count=1M bs=1
git annex add brain2.nii.gz
git annex lock brain2.nii.gz
git add -u
git commit -m "version 1.0.1"
cd ..
Then I cloned that (to simulate the github -> local link)
git clone ds1-github ds1-clone
Now, I make a change to the non-git copy to simulate the edits you've made that you're unsure about:
cd ds1
dd if=/dev/urandom of=brain2.nii.gz count=1M bs=1
dd if=/dev/urandom of=brain3.nii.gz count=1M bs=1
cd ..
Now, merging these copies.
There's a catch here: git-annex is mostly designed to be in charge of everything; because it never managed our original repo, we have to use --allow-unrelated-histories
, and it decides there are conflicts on brain1.nii.gz (which the collaborator changed, so is newer and should win) and README.md (which didn't change at all!):
cd ds1-clone
git annex initremote ds1-local type=directory directory=../ds1 encryption=none exporttree=yes importtree=yes
git annex import --from=ds1-local master
git merge --allow-unrelated-histories ds1-local/master
$ git merge --allow-unrelated-histories ds1-local/master
CONFLICT (add/add): Merge conflict in brain2.nii.gz
Auto-merging brain2.nii.gz
CONFLICT (add/add): Merge conflict in brain1.nii.gz
Auto-merging brain1.nii.gz
CONFLICT (add/add): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.
[kousu@requiem ds1-clone]$ git status
On branch master
Your branch is up to date with 'origin/master'.
You have unmerged paths.
(fix conflicts and run "git commit")
(use "git merge --abort" to abort the merge)
Changes to be committed:
new file: brain3.nii.gz
Unmerged paths:
(use "git add <file>..." to mark resolution)
both added: README.md
both added: brain1.nii.gz
both added: brain2.nii.gz
Again, you have to fix the merge by choosing one file or the other.
git reset README.md
git reset brain1.nii.gz
git annex unlock brain2.nii.gz
cp ../ds1/brain2.nii.gz .
git annex add -u
But I only knew how to do this because I created the conflict myself. At this point, it seems easier to just stick with the rsync
method. git annex import
is mostly helpful if the remote you're importing from was initially created by git-annex. But maybe for a larger dataset with less modifications-per-file this method is still helpful? I'm unsure at this point, I would need to download a larger dataset to experiment on.
Thank you so much for your help @kousu !
So, i went ahead with a datalad/openneuro-cli approach:
# get the dataset
datalad install https://github.com/OpenNeuroDatasets/ds002939.git
cd ds002939/
# update it with latest version from openneuro
datalad get .
datalad unlock .
# modified participants.tsv
# upload to openneuro using openneuro-cli
mv {.datalad,.git,.gitattributes} ..
openneuro upload -i -d ds002939
p.s. the "mv" is important, otherwise the .git objects are uploaded, as seen in this log file log_20200626-164854.txt
On openneuro:
Back to my laptop:
# update datalad
datalad update
# checking if changes from openneuro are propagated:
cat CHANGE
# output: 1.0.0 2020-06-18
# expected (from openneuro website):
# 1.0.2 2020-06-26
# - participants.tsv
# 1.0.0 2020-06-18
#
# - Initial snapshot
I assume this issue is caused by the git annex database not being updated by openneuro?
In any case, this little experiment demonstrates that the datalad/openneuro-cli workflow is unreliable (at least, the way i did it).
Related to #1659
(sorry if this is a duplicate, I did a quick search but haven't found this problem addressed)
here is a scenario i am facing:
Problem: I don’t see any openneuro cli to do that. According to the doc,
openneuro download
only downloads, and skip files if they already exist (which is not what i want here).is the only solution to download a fresh 1.0.2, do the modification and then upload?