cloudyr / aws.s3

Amazon Simple Storage Service (S3) API Client
https://cloud.r-project.org/package=aws.s3
381 stars 148 forks source link

Bug: s3sync won't sync to local path #345

Closed kenahoo closed 4 years ago

kenahoo commented 4 years ago

I'm trying to implement the equivalent of a command like aws s3 sync s3://landsat-pds/test test using the aws.s3 package.

First issue

The first problem is some kind of protocol issue:

> library(aws.s3)
> s3sync(files='test', bucket='s3://landsat-pds/test/', direction='download')
Redirection: (301) Moved Permanently
List of 5
 $ Code              : chr "InvalidLocationConstraint"
 $ Message           : chr "The specified location-constraint is not valid"
 $ LocationConstraint: list()
 $ RequestId         : chr "2C646AED80DEDDB2"
 $ HostId            : chr "m//lFmeoXBaonE+w9189Oq4jMfAeyn+Y+fLlscO7SVNhY0YBIVC/XtiR9PRPmfzJ3PV9Ia2wdqI="
 - attr(*, "headers")=List of 7
  ..$ x-amz-request-id : chr "2C646AED80DEDDB2"
  ..$ x-amz-id-2       : chr "m//lFmeoXBaonE+w9189Oq4jMfAeyn+Y+fLlscO7SVNhY0YBIVC/XtiR9PRPmfzJ3PV9Ia2wdqI="
  ..$ content-type     : chr "application/xml"
  ..$ transfer-encoding: chr "chunked"
  ..$ date             : chr "Tue, 17 Mar 2020 21:33:48 GMT"
  ..$ connection       : chr "close"
  ..$ server           : chr "AmazonS3"
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 - attr(*, "class")= chr "aws_error"
 - attr(*, "request_canonical")= chr "PUT\n/landsat-pds/\n\nhost:s3.amazonaws.com\nx-amz-acl:private\nx-amz-date:20200317T213348Z\n\nhost;x-amz-acl;x"| __truncated__
 - attr(*, "request_string_to_sign")= chr "AWS4-HMAC-SHA256\n20200317T213348Z\n20200317/us-east-1/s3/aws4_request\n5c33df0f53e69098a29791e2da3a6b974f6c530"| __truncated__
 - attr(*, "request_signature")= chr "AWS4-HMAC-SHA256 Credential=AKIASBDO3N3K55ENOAK7/20200317/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-a"| __truncated__
NULL
Error in parse_aws_s3_response(r, Sig, verbose = verbose) : 
  Bad Request (HTTP 400).

Notice that it's trying to do a PUT request - that seems bad, when I'm only trying to download?

This is a publicly available dataset (which I did not create, I found it at https://registry.opendata.aws/landsat-8/), so I think it should work without credentials.

Second issue

Secondly, when I solve the above issue (by using a private bucket and specifying credentials explicitly), there seem to be path problems:

> aws.signature::use_credentials('dev-creds')
> aws.s3::s3sync(files='localdir/', bucket='s3://my-dev-bucket/test/2020-03-08_Test', direction='download')"
1 local file to sync
Getting bucket 's3://my-dev-bucket/test/2020-03-08_Test'
7390 objects retrieved from bucket 's3://my-dev-bucket/test/2020-03-08_Test'
7390 bucket objects not found in local directory
Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Failed to open file ForecastDataSets/.
Calls: <Anonymous> ... request_fetch -> request_fetch.write_disk -> <Anonymous>
Execution halted

Notice that it did connect successfully to the bucket and read the data from it, but it's trying to write in a local directory called ForecastDataSets/, when I'm trying to sync with the localdir/ directory.

Workaround

Can do system("aws --profile dev-creds s3 sync s3://my-dev-bucket/test/2020-03-08_Test localdir") and avoid using the aws.s3 package for syncing.

Session info:

% Rscript -e "library(aws.s3); sessionInfo()"
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] aws.s3_0.3.20

loaded via a namespace (and not attached):
 [1] httr_1.4.1          compiler_3.6.1      R6_2.4.0            tools_3.6.1        
 [5] base64enc_0.1-3     curl_4.1            Rcpp_1.0.2          aws.signature_0.5.2
 [9] xml2_1.2.2          digest_0.6.20      
s-u commented 4 years ago

Please don't file two issues in one. Note that files is the list of files you want to synchronize, but you specified a directory - this is likely cause for your second issue (you probably intended dir("localdir",recursive=TRUE)). Also bucket is the name of the bucket - not a URL, so the bucket doesn't exist hence s3sync tries to create bucket with the name "s3://landsat-pds/test" it which fails due to region mismatch (for which s3sync should pass-through ... which it doesn't). s3sync currently doesn't support syncing into a different subdirectory of the bucket.

kenahoo commented 4 years ago

I think one of us might be misunderstanding. I'm doing a download, not an upload, so it's never creating any buckets. It seems like most of your response thinks I'm doing an upload, or two-way sync?

Also, it does seem to read all the proper data from S3, so I felt like the bucket argument was supplied correctly, though of course the URL is not a "bucket".

Taking a step back, though - to sync two directories, I have to enumerate all the files in the directories myself? Doesn't this defeat the point of syncing? How would I know what all the files are before I sync?

Or maybe this function isn't doing a similar thing to aws s3 sync? If not, I wonder whether a different name might be better.

s-u commented 4 years ago

It doesn't matter - the old code created a bucket if it didn't exist no matter whether you use download or upload. And like I said - the docs were clearly saying that you supply the list of files and a bucket name, not URL. That doesn't mean it made sense, it's how it was written. I re-wrote it yesterday so check the new docs and code and feel free to file new issues, but please make sure you file issues against documented behavior, not your expectations. You can also file enhancement requests if you think there is a better way.