cloudyr / aws.s3

Amazon Simple Storage Service (S3) API Client
https://cloud.r-project.org/package=aws.s3
381 stars 147 forks source link

Error in UseMethod("xmlSApply") #40

Closed milosgajdos closed 8 years ago

milosgajdos commented 8 years ago

I've installed this wonderfully looking package and tried to retrieve a list of files in one of my S3 buckets, but I seem to be getting following error:

> aws.s3::getbucket(bucket = "kaggle.ml.data") No encoding supplied: defaulting to UTF-8. Error in UseMethod("xmlSApply") : no applicable method for 'xmlSApply' applied to an object of class "c('xml_document', 'xml_node')"

Platform: Mac OS X 10.11.3 (El Captain)

R version: R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit)

leeper commented 8 years ago

Thanks for this. The code was not setup to accommodate httr's switch from XML to xml2 as the default XML parser. I believe this is now fixed. Can you confirm that it is now working for you, @milosgajdos83?

milosgajdos commented 8 years ago

Excellent! Now, the request does go through without any error and returns s3_object.

It seems this has done the trick :+1:

Now, a quick unrelated R-n00b question (if u don't mind) - I'm only just beginning to learn R :-) How do I access each bucket item individually?

From what I can see in the code aws.s3::getbucket(bucket = "bucket_name") returns s3_object with the following attributes:

> attributes(my_bucket)
$names
[1] "Name"        "Prefix"      "Marker"      "MaxKeys"     "IsTruncated"
[6] "Contents"    "Contents"    "Contents"    "Contents"

$class
[1] "s3_bucket"

Now I can see the Contents attribute contains the actual objects stored in the S3 bucket in the list?:

> summary(my_bucket)
            Length Class     Mode
Name        1      -none-    character
Prefix      0      -none-    NULL
Marker      0      -none-    NULL
MaxKeys     1      -none-    character
IsTruncated 1      -none-    character
Contents    6      s3_object list
Contents    6      s3_object list
Contents    6      s3_object list
Contents    6      s3_object list

Is there any way how I can easily iterate through each of the objects (even better, through every file object?). I thought I'd be able to access them via list indexes, but it seems ALL of the S3 objects are indexed with index nr 1? i.e. they're being overridden with the last S3 object retrieved and thus can't be iterated over ?

> summary(my_bucket$Contents)
             Length Class  Mode
Key          1      -none- character
LastModified 1      -none- character
ETag         1      -none- character
Size         1      -none- character
Owner        2      -none- list
StorageClass 1      -none- character

> attributes(my_bucket$Contents)
$names
[1] "Key"          "LastModified" "ETag"         "Size"         "Owner"
[6] "StorageClass"

$class
[1] "s3_object"
>

I'm totally sure I'm missing something, so I'd really appreciated a bit of help :-) Lastly huge thanks for the awesome Cloudyr work!

markdanese commented 8 years ago

I just got the same error. Just installed from github.

> library(aws.s3)
> Sys.setenv("AWS_ACCESS_KEY_ID" = "xxx",
+            "AWS_SECRET_ACCESS_KEY" = "yyy",
+            "AWS_DEFAULT_REGION" = "us-east-1")
> bucketlist()

No encoding supplied: defaulting to UTF-8. Error in UseMethod("xmlSApply") : no applicable method for 'xmlSApply' applied to an object of class "c('xml_document', 'xml_node')"

httr version 1.1.0 xml2 version 0.1.2 r version 3.2.4

It is accessing the server. Seems to be a problem parsing the list.

UPDATE: apparently I was using the "latest stable version" and not the most current version from github (requiring ghit). So problem is solved. Just have to use the right development version!

cboettig commented 8 years ago

@milosgajdos83 I think you want getobject()? Here's an example of iterating over a bucket to get all objects in the bucket: https://github.com/ropensci/drat/blob/gh-pages/parse_s3_logs.R#L19-L26, and then iterating over that object list to download each object: https://github.com/ropensci/drat/blob/gh-pages/parse_s3_logs.R#L30-L34 . @leeper can probably comment if there's a better way, but I believe the S3 API is pretty low level. Perhaps we can abstract this into some helper functions in the R package for these common tasks.

cboettig commented 8 years ago

@leeper From @markdanese 's error it looks like we must using the httr::content() call to parse XML without specifying the parser? I'd try a PR but I'm just not spotting the content() call.

As you probably know, httr recently dropped XML in favor of xml2 as the default XML parser, so when relying on the automatically detected httr::content() the function is returning an xml_document mentioned in Mark's error message (the xml2 object class), rather than an xmlDocument that the old XML::xmlSApply needs.

replacing the current call to content with xmlParse(httr::content(response, as="text")) would do the trick if only I could find it. alternately we might just want to move over to xml2...

markdanese commented 8 years ago

@cboettig -- I just updated the issue. I think it is solved, but it is only in the development version.

leeper commented 8 years ago

I am pushing this through today with a number of breaking changes. We are abandoning XML in favor of xml2 and get_bucket() (new function, replacing getbucket()) is going to return a list of object of class "s3_object". The bucket metadata that used to be part of a "s3_bucket" object is now stored in that object's attributes. As such, if you wanted to, for example, get every object out of a bucket you should be able to do:

library("aws.s3")
b <- get_bucket(bucket = "mybucket")

# load objects as raw vectors in memory
lapply(b, get_object) 

# save all objects locally to specified vector of file names
mapply(save_object, object = b, file = paste0("file", seq_along(b), ".txt"))

Hope this solves everyones' issues.