Open matschaffer opened 5 years ago
If possible create the bucket using https://github.com/Safecast/infrastructure/blob/master/terraform/common/data_s3.tf#L32, but if you're not comfortable with terraform yet just a manually created bucket is probably okay to test the theory.
Code so far
#downloading required packages: aws.s3, RAmazonS3, rjson and RJSONIO
install.packages("devtools")
library(devtools)
install_github("duncantl/RAmazonS3")
library(RAmazonS3)
install.packages("rjson") #package used to read JSON arrays in R
library(rjson)
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat")) #to retrieve data from S3 buckets
library("aws.s3")
install.packages("RJSONIO")
library(RJSONIO)
######################getting the object (data from 2017-10-01 h12:00)######################
#checking if object with specified path name exists
object_exists("s3://safecastdata-us-west-2/ingest/prd/s3raw/2017-10-01/00/12", region="us-west-2") #object exists: TRUE
# region="us -west-2" as default region is us-west-1
safc <- get_object("s3://safecastdata-us-west-2/ingest/prd/s3raw/2017-10-01/00/12", region="us-west-2")
mode(safc) # type of data = "raw"
safc
#R automatically reads the object using "unknown" (ASCII) encoding, so:
library(stringr)
# convert raw to vector
safecast <- rawToChar(safc)
safecast # now data appear in JSON format
# Load the packages required to read JSON files.
library(RJSONIO)
library("rjson")
safecastdata2017_10_01_00_12 <- fromJSON(safecast) #read JSON lines
safecastdata2017_10_01_00_12
summary(safecastdata2017_10_01_00_12) # data for 2017/10/01 h12 successfully saved
#################################################################
###################### getting other objects ####################
object_exists("s3://safecastdata-us-west-2/ingest/prd/s3raw/", region="us-west-2") #error 404: not found
s3raw <- get_object("s3://safecastdata-us-west-2/ingest/prd/s3raw/", region="us-west-2")
s3raw
mode(s3raw)
library(stringr)
s3raw <- rawToChar(s3raw) # convert to UTF8
s3raw #error - "the specified key does not exist"
Looks like get_bucket("safecastdata-us-west-2", prefix="ingest/prd/s3raw/", region="us-west-2")
does the trick.
Still I think we should try a fully public read-only bucket. It's pretty likely that (1) the R code will get less fiddly and (2) other S3 libraries work more cleanly when the bucket is listable w/o any additional parameters.
I'm tempted to even put that bucket in us-east-1. We hit early trouble with the R library defaulting to us-east-1 and failing on the 302 over to us-west-2. I'm sure there are other libraries that assume bucket region similarly.
This will incur some transfer cost since data will have to get exported from us-west-2 to us-east-1, but I suspect it will be far less than our compute costs (possibly even in the <$1/mo range)
Retitling to address the plan of a fully-open us-east-1 bucket to hopefully make it as easy as possible to access our exported data from any runtime w/ minimal parameters specified.
This is sort of done by doing a one-time push of some data to our safecast-opendata account.
So the bucket is there and basically ready. Leaving this open to address continual publish of the data to the new bucket referenced in https://github.com/awslabs/open-data-registry/blob/master/datasets/safecast.yaml#L18
I've been helping one of our analysis with getting access to ingest data via R.
It seems the R S3 library doesn't work well with our model where the data is readable only under a prefix (e.g.,
s3://safecastdata-us-west-2/ingest/prd/s3raw/
is open, buts3://safecastdata-us-west-2/
is not).For starters I think we should set up a
s3://safecastdata-public-us-west-2
bucket with totally open read access and copy a few days of data froms3://safecastdata-us-west-2/ingest/prd/s3raw/
. If that works we can switch the ingest worker to publish there and hopefully make it easier for others to ingest the data.