Closed NickleDave closed 4 months ago
(Image: CC-BY license, The Turing Way Community, & Scriberia. Zenodo. https://doi.org/10.5281/zenodo.3332808) Hi there, and welcome to the DataLad Handbook! :orange_book: :wave: Thank you for filing an issue. We're excited to have your input and welcome your idea! :blush: If you haven't done so already, please make sure you check out our Code of Conduct.
I think I was able to get this to work, writing down what I did in case it helps with this section / helps someone else (or in case someone wants to tell me I did it wrong)
datalad create --force
in the directory with the data, and datalad save
, as in https://handbook.datalad.org/en/latest/basics/101-139-s3.html#your-datalad-datasetpublic=True
git annex initremote public-s3 type=S3 encryption=none \
bucket=name-of-the-bucket-you-want-git-annex-to-make-for-you datacenter=US autoenable=true
git annex enableremote public-s3 \
publicurl="https://name-of-the-bucket-you-want-git-annex-to-make-for-you.s3.amazonaws.com"
git remote add upstream git@github.com:me/dataset.git
(I also tried to use the alternative datalad siblings add --dataset . --name upstream --url git@github.com:me/dataset.git
but got a cryptic error)
--publish-depends
on the S3 store in one step as in the walkthrough (datalad create-sibling-github -d . neuro-data-s3 --publish-depends public-s3
), I needed to set the --publish-depends
option separately. It was not clear to me that I needed to use a sub-command for this: datalad siblings configure
--it might help to add some examples of using these sub-commands to the man page + docs. But since the docstring for the --publish-depends
option says that it is equivalent to setting a config, I finally figured out that was what I needed to dodatalad siblings configure -s upstream --publish-depends public-s3
Having done that I was able to datalad push --to upstream
and have datalad/git-annex push the annexed contents to the special remote
Then I took off all "block public access" settings in S3 for the bucket. I did not add a bucket policy that allows access (as described in this comment on the git-annex wiki)
mkdir tmp
cd tmp
datalad clone git@github.com:me/dataset.git
cd dataset
datalad get . -r
Anecdotally I found that it was much faster in the terminal where I had AWS credentials as env variables compared to a second test in a new terminal. It's not clear to me if this is because of the credentials or a connection issue or AWS is throttling me for some other reason.
Also noting I got this cryptic message when I datalad clone
the dataset, not sure what it means
[INFO ] Unable to parse git config from origin
[INFO ] Remote origin does not have git-annex installed; setting annex-ignore
[INFO ] This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin
:thinking: I added a bucket policy allowing GET
as in this comment and the download seems to be faster even without my AWS credentials exported to env variables ... not clear still if that's an AWS thing or I just had a bad connection
Hi @NickleDave, thanks a lot for this thorough report! This will come in handy when the chapter gets updated, which definitely seems to be necessary. And sorry for the long wait for a response btw.
Thank you @NickleDave, this is a very thorough report and I think you ended up making all the right choices! I agree that the walkthrough needs an update, though it likely won't be a complete overhaul - mostly to address publicurl
being required and public
being deprecated. Other rough edges that you point out might be worth addressing, too.
Making a few additional notes:
publicurl
behaviour was introduced in this git-annex commit on 21 Jul 2023. This means that git-annex which is currently in Debian stable (10.20230126-3), and which I use by default, does not handle this case properly (get
does not recognize publicurl
when public=no
; git annex info remote-name
does not show the public url). I tried with the latest snapshot and everything works properly.Regarding the questions:
:question: Clone INFO messages:
Also noting I got this cryptic message when I
datalad clone
the dataset, not sure what it means[INFO ] Unable to parse git config from origin [INFO ] Remote origin does not have git-annex installed; setting annex-ignore [INFO ] This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin
I agree that this message can be confusing, especially the 3rd line. It shows up because origin is only a git remote (as opposed to git-annex special remote) and it can not be used to retrieve annexed contents; annex-ignore=True
is written to the respective section of .gitconfig
so that git-annex won't even attempt to check it for annexed content. In this case, this is all to be expected, so there is no problem and 3rd line can be ignored. Note that there are some remotes which can handle git + annex.
:question: Siblings add:
(I also tried to use the alternative
datalad siblings add --dataset . --name upstream --url git@github.com:me/dataset.git
but got a cryptic error)
All I can offer is that it works for me (in that it creates a correct remote configuration) - but boy oh boy is there a lot of messaging... May be worth a separate DataLad issue if it shows in the latest version (I'm on 0.19.6). Here's what I saw:
Thanks a lot for the detailed issue and proposed solutions! I have put up a PR which should fix it, and recognized the contribution in https://github.com/datalad-handbook/book/pull/1232. Please feel free to add yourself to this repository's .zenodo.json file (second to last position, before Michael Hanke), or leave your details (name, orcid, affiliation) as a comment and I'll add you, @NickleDave :)
Hi, thank you for Datalad and the handbook.
I am working through the walkthrough on setting up S3 as a special remote, using my own small dataset.
When I run the
git annex initremote
command as shown there, I get an error.I think this is the same issue as described here on the git annex wiki that is linked to from that chapter of the handbook: https://git-annex.branchable.com/special_remotes/S3/#comment-fcfba0021592de4c1425d3bf3c9563d3
I haven't wrapped my head around the AWS policy-ese they outline in the suggested fix but I thought I should go ahead and report the issue here