DyfanJones / s3fs

Access Amazon Web Service 'S3' as if it were a file system. File system 'API' design around R package 'fs'
https://dyfanjones.github.io/s3fs/
Other
41 stars 1 forks source link
aws aws-s3 fs minio r r-package

s3fs

s3fs status
badge R-CMD-check Codecov test
coverage CRAN
status

s3fs provides a file-system like interface into Amazon Web Services for R. It utilizes paws SDKand R6 for it’s core design. This repo has been inspired by Python’s s3fs, however it’s API and implementation has been developed to follow R’s fs.

Installation

You can install the released version of s3fs from CRAN with:

install.packages('s3fs')

r-universe installation:

# Enable repository from dyfanjones
options(repos = c(
  dyfanjones = 'https://dyfanjones.r-universe.dev',
  CRAN = 'https://cloud.r-project.org')
)

# Download and install s3fs in R
install.packages('s3fs')

Github installation

remotes::install_github("dyfanjones/s3fs")

Dependencies

Comparison with fs

s3fs attempts to give the same interface as fs when handling files on AWS S3 from R.

Extra features:

For example: copy a large file from one location to the next.

library(s3fs)
library(future)

plan("multisession")

s3_file_copy("s3://mybucket/multipart/large_file.csv", "s3://mybucket/new_location/large_file.csv")

s3fs to copy a large file (> 5GB) using multiparts, future allows each multipart to run in parallel to speed up the process.

For example: Copying a large file from AWS S3 to R.

library(s3fs)
library(future)

plan("multisession")

s3_file_copy_async("s3://mybucket/multipart/large_file.csv", "large_file.csv")

Usage

fs has a straight forward API with 4 core themes:

s3fs follows theses themes with the following:

NOTE: link_ is currently not supported.

library(s3fs)

# Construct a path to a file with `path()`
s3_path("foo", "bar", letters[1:3], ext = "txt")
#> [1] "s3://foo/bar/a.txt" "s3://foo/bar/b.txt" "s3://foo/bar/c.txt"

# list buckets
s3_dir_ls()
#> [1] "s3://MyBucket1"
#> [2] "s3://MyBucket2"                                        
#> [3] "s3://MyBucket3"               
#> [4] "s3://MyBucket4"                            
#> [5] "s3://MyBucket5"

# list files in bucket
s3_dir_ls("s3://MyBucket5")
#> [1] "s3://MyBucket5/iris.json"     "s3://MyBucket5/athena-query/"
#> [3] "s3://MyBucket5/data/"         "s3://MyBucket5/default/"     
#> [5] "s3://MyBucket5/iris/"         "s3://MyBucket5/made-up/"     
#> [7] "s3://MyBucket5/test_df/"

# create a new directory
tmp <- s3_dir_create(s3_file_temp(tmp_dir = "MyBucket5"))
tmp
#> [1] "s3://MyBucket5/filezwkcxx9q5562"

# create new files in that directory
s3_file_create(s3_path(tmp, "my-file.txt"))
#> [1] "s3://MyBucket5/filezwkcxx9q5562/my-file.txt"
s3_dir_ls(tmp)
#> [1] "s3://MyBucket5/filezwkcxx9q5562/my-file.txt"

# remove files from the directory
s3_file_delete(s3_path(tmp, "my-file.txt"))
s3_dir_ls(tmp)
#> character(0)

# remove the directory
s3_dir_delete(tmp)

Created on 2022-06-21 by the reprex package (v2.0.1)

Similar to fs, s3fs is designed to work well with the pipe.

library(s3fs)
paths <- s3_file_temp(tmp_dir = "MyBucket") |>
 s3_dir_create() |>
 s3_path(letters[1:5]) |>
 s3_file_create()
paths
#> [1] "s3://MyBucket/fileazqpwujaydqg/a"
#> [2] "s3://MyBucket/fileazqpwujaydqg/b"
#> [3] "s3://MyBucket/fileazqpwujaydqg/c"
#> [4] "s3://MyBucket/fileazqpwujaydqg/d"
#> [5] "s3://MyBucket/fileazqpwujaydqg/e"

paths |> s3_file_delete()
#> [1] "s3://MyBucket/fileazqpwujaydqg/a"
#> [2] "s3://MyBucket/fileazqpwujaydqg/b"
#> [3] "s3://MyBucket/fileazqpwujaydqg/c"
#> [4] "s3://MyBucket/fileazqpwujaydqg/d"
#> [5] "s3://MyBucket/fileazqpwujaydqg/e"

Created on 2022-06-22 by the reprex package (v2.0.1)

NOTE: all examples have be developed from fs.

File systems that emulate S3

s3fs allows you to connect to file systems that provides an S3-compatible interface. For example, MinIO offers high-performance, S3 compatible object storage. You will be able to connect to your MinIO server using s3fs::s3_file_system:

library(s3fs)

s3_file_system(
  aws_access_key_id = "minioadmin",  
  aws_secret_access_key = "minioadmin",
  endpoint = "http://localhost:9000"
)

s3_dir_ls()
#> [1] ""

s3_bucket_create("s3://testbucket")
#> [1] "s3://testbucket"

# refresh cache
s3_dir_ls(refresh = T)
#> [1] "s3://testbucket"

s3_bucket_delete("s3://testbucket")
#> [1] "s3://testbucket"

# refresh cache
s3_dir_ls(refresh = T)
#> [1] ""

Created on 2022-12-14 with reprex v2.0.2

NOTE: if you to want change from AWS S3 to Minio in the same R session, you will need to set the parameter refresh = TRUE when calling s3_file_system again. You can use multiple sessions by using the R6 class S3FileSystem directly.

Feedback wanted

Please open a Github ticket raising any issues or feature requests.