Safecast / safecastapi

The app that powers api.safecast.org
44 stars 25 forks source link

Publish data to open us-east-1 S3 bucket #505

Open matschaffer opened 5 years ago

matschaffer commented 5 years ago

I've been helping one of our analysis with getting access to ingest data via R.

It seems the R S3 library doesn't work well with our model where the data is readable only under a prefix (e.g., s3://safecastdata-us-west-2/ingest/prd/s3raw/ is open, but s3://safecastdata-us-west-2/ is not).

For starters I think we should set up a s3://safecastdata-public-us-west-2 bucket with totally open read access and copy a few days of data from s3://safecastdata-us-west-2/ingest/prd/s3raw/. If that works we can switch the ingest worker to publish there and hopefully make it easier for others to ingest the data.

matschaffer commented 5 years ago

If possible create the bucket using https://github.com/Safecast/infrastructure/blob/master/terraform/common/data_s3.tf#L32, but if you're not comfortable with terraform yet just a manually created bucket is probably okay to test the theory.

matschaffer commented 5 years ago

Code so far

#downloading required packages: aws.s3, RAmazonS3, rjson and RJSONIO
install.packages("devtools")
library(devtools)
install_github("duncantl/RAmazonS3")
library(RAmazonS3)
install.packages("rjson") #package used to read JSON arrays in R
library(rjson)
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat")) #to retrieve data from S3 buckets
library("aws.s3")
install.packages("RJSONIO")
library(RJSONIO)

######################getting the object (data from 2017-10-01 h12:00)######################

#checking if object with specified path name exists
object_exists("s3://safecastdata-us-west-2/ingest/prd/s3raw/2017-10-01/00/12", region="us-west-2") #object exists: TRUE
# region="us -west-2" as default region is us-west-1

safc <- get_object("s3://safecastdata-us-west-2/ingest/prd/s3raw/2017-10-01/00/12", region="us-west-2")
mode(safc) # type of data = "raw"
safc

#R automatically reads the object using "unknown" (ASCII) encoding, so:
library(stringr)
# convert raw to vector
safecast <- rawToChar(safc)
safecast # now data appear in JSON format

# Load the packages required to read JSON files.
library(RJSONIO)
library("rjson")
safecastdata2017_10_01_00_12 <- fromJSON(safecast) #read JSON lines
safecastdata2017_10_01_00_12
summary(safecastdata2017_10_01_00_12) # data for 2017/10/01 h12 successfully saved

#################################################################

###################### getting other objects ####################

object_exists("s3://safecastdata-us-west-2/ingest/prd/s3raw/", region="us-west-2") #error 404: not found
s3raw <- get_object("s3://safecastdata-us-west-2/ingest/prd/s3raw/", region="us-west-2")
s3raw
mode(s3raw)
library(stringr)
s3raw <- rawToChar(s3raw) # convert to UTF8
s3raw #error - "the specified key does not exist"
matschaffer commented 5 years ago

Looks like get_bucket("safecastdata-us-west-2", prefix="ingest/prd/s3raw/", region="us-west-2") does the trick.

Still I think we should try a fully public read-only bucket. It's pretty likely that (1) the R code will get less fiddly and (2) other S3 libraries work more cleanly when the bucket is listable w/o any additional parameters.

matschaffer commented 5 years ago

I'm tempted to even put that bucket in us-east-1. We hit early trouble with the R library defaulting to us-east-1 and failing on the 302 over to us-west-2. I'm sure there are other libraries that assume bucket region similarly.

This will incur some transfer cost since data will have to get exported from us-west-2 to us-east-1, but I suspect it will be far less than our compute costs (possibly even in the <$1/mo range)

matschaffer commented 5 years ago

Retitling to address the plan of a fully-open us-east-1 bucket to hopefully make it as easy as possible to access our exported data from any runtime w/ minimal parameters specified.

matschaffer commented 5 years ago

This is sort of done by doing a one-time push of some data to our safecast-opendata account.

So the bucket is there and basically ready. Leaving this open to address continual publish of the data to the new bucket referenced in https://github.com/awslabs/open-data-registry/blob/master/datasets/safecast.yaml#L18