datalad-handbook / book

Sources for the DataLad handbook
http://handbook.datalad.org
Other
139 stars 56 forks source link

Issues with walkthrough for S3 special remote #1224

Closed NickleDave closed 1 week ago

NickleDave commented 1 month ago

Hi, thank you for Datalad and the handbook.

I am working through the walkthrough on setting up S3 as a special remote, using my own small dataset.

When I run the git annex initremote command as shown there, I get an error.

$ git annex initremote public-s3 type=S3 encryption=none \
> bucket=$BUCKET public=yes datacenter=EU autoenable=true 
initremote public-s3 (checking bucket...) (creating bucket in EU...) 
git-annex: S3Error {s3StatusCode = Status {statusCode = 400, statusMessage = "Bad Request"}, s3ErrorCode = "InvalidBucketAclWithObjectOwnership", s3ErrorMessage = "Bucket cannot have ACLs set with ObjectOwnership's BucketOwnerEnforced setting", s3ErrorResource = Nothing, s3ErrorHostId = Just "[possibly sensitive info redacted]", s3ErrorAccessKeyId = Nothing, s3ErrorStringToSign = Nothing, s3ErrorBucket = Nothing, s3ErrorEndpointRaw = Nothing, s3ErrorEndpoint = Nothing}
failed
initremote: 1 failed

I think this is the same issue as described here on the git annex wiki that is linked to from that chapter of the handbook: https://git-annex.branchable.com/special_remotes/S3/#comment-fcfba0021592de4c1425d3bf3c9563d3

ACL deprecation vs public=yes

Amazon has deprecated ACLs

A majority of modern use cases in Amazon S3 no longer require the use of ACLs, and we recommend that you disable ACLs except in unusual circumstances where you need to control access for each object individually. With Object Ownership, you can disable ACLs and rely on policies for access control. When you disable ACLs, you can easily maintain a bucket with objects uploaded by different AWS accounts. You, as the bucket owner, own all the objects in the bucket and can manage access to them using policies.

They are encouraging everyone to migrate to bucket policies instead.

I haven't wrapped my head around the AWS policy-ese they outline in the suggested fix but I thought I should go ahead and report the issue here

welcome[bot] commented 1 month ago

Welcome Banner (Image: CC-BY license, The Turing Way Community, & Scriberia. Zenodo. https://doi.org/10.5281/zenodo.3332808) Hi there, and welcome to the DataLad Handbook! :orange_book: :wave: Thank you for filing an issue. We're excited to have your input and welcome your idea! :blush: If you haven't done so already, please make sure you check out our Code of Conduct.

NickleDave commented 1 month ago

I think I was able to get this to work, writing down what I did in case it helps with this section / helps someone else (or in case someone wants to tell me I did it wrong)

git annex initremote public-s3 type=S3 encryption=none \
bucket=name-of-the-bucket-you-want-git-annex-to-make-for-you datacenter=US autoenable=true
git remote add upstream git@github.com:me/dataset.git

(I also tried to use the alternative datalad siblings add --dataset . --name upstream --url git@github.com:me/dataset.git but got a cryptic error)

datalad siblings configure -s upstream --publish-depends public-s3

image

Anecdotally I found that it was much faster in the terminal where I had AWS credentials as env variables compared to a second test in a new terminal. It's not clear to me if this is because of the credentials or a connection issue or AWS is throttling me for some other reason.

NickleDave commented 1 month ago

Also noting I got this cryptic message when I datalad clone the dataset, not sure what it means

[INFO   ] Unable to parse git config from origin                                                                                                  
[INFO   ] Remote origin does not have git-annex installed; setting annex-ignore                                                                   
[INFO   ] This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 
NickleDave commented 1 month ago

:thinking: I added a bucket policy allowing GET as in this comment and the download seems to be faster even without my AWS credentials exported to env variables ... not clear still if that's an AWS thing or I just had a bad connection

jsheunis commented 1 month ago

Hi @NickleDave, thanks a lot for this thorough report! This will come in handy when the chapter gets updated, which definitely seems to be necessary. And sorry for the long wait for a response btw.

mslw commented 1 month ago

Thank you @NickleDave, this is a very thorough report and I think you ended up making all the right choices! I agree that the walkthrough needs an update, though it likely won't be a complete overhaul - mostly to address publicurl being required and public being deprecated. Other rough edges that you point out might be worth addressing, too.

Making a few additional notes:

mslw commented 1 month ago

Regarding the questions:

:question: Clone INFO messages:

Also noting I got this cryptic message when I datalad clone the dataset, not sure what it means

[INFO   ] Unable to parse git config from origin                                                                                                  
[INFO   ] Remote origin does not have git-annex installed; setting annex-ignore                                                                   
[INFO   ] This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 

I agree that this message can be confusing, especially the 3rd line. It shows up because origin is only a git remote (as opposed to git-annex special remote) and it can not be used to retrieve annexed contents; annex-ignore=True is written to the respective section of .gitconfig so that git-annex won't even attempt to check it for annexed content. In this case, this is all to be expected, so there is no problem and 3rd line can be ignored. Note that there are some remotes which can handle git + annex.

:question: Siblings add:

(I also tried to use the alternative datalad siblings add --dataset . --name upstream --url git@github.com:me/dataset.git but got a cryptic error)

All I can offer is that it works for me (in that it creates a correct remote configuration) - but boy oh boy is there a lot of messaging... May be worth a separate DataLad issue if it shows in the latest version (I'm on 0.19.6). Here's what I saw:

overly verbose siblings add> ``` ❱ datalad siblings add --dataset . --name upstream --url git@github.com:mslw/shiny-invention.git [INFO ] Could not annex-enable upstream: Unable to parse git config from upstream Remote upstream does not have git-annex installed; setting annex-ignore This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote upstream Unable to parse git config from upstream Remote upstream does not have git-annex installed; setting annex-ignore This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote upstream enableremote: 1 failed .: upstream(-) [git@github.com:mslw/shiny-invention.git (git)] mszczepanik@bnbnb64 in /tmp/neuro-data-s3 on git:main ❱ datalad siblings .: here(+) [git] .: public-s3(+) [git] .: upstream(-) [git@github.com:mslw/shiny-invention.git (git)] ```
adswa commented 1 week ago

Thanks a lot for the detailed issue and proposed solutions! I have put up a PR which should fix it, and recognized the contribution in https://github.com/datalad-handbook/book/pull/1232. Please feel free to add yourself to this repository's .zenodo.json file (second to last position, before Michael Hanke), or leave your details (name, orcid, affiliation) as a comment and I'll add you, @NickleDave :)