DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

Fetching Data from Athena Speed test #25

Closed DyfanJones closed 5 years ago

DyfanJones commented 5 years ago
library(noctua)

con <- dbConnect(noctua::athena())

res <- dbExecute(con, "select * from sampledb.elb_logs")
result <- noctua:::poll(res)
result_info <- noctua:::split_s3_uri(result$QueryExecution$ResultConfiguration$OutputLocation)

obj <- res@connection@ptr$S3$get_object(Bucket = result_info$bucket, Key = result_info$key)

# method_1
data.table::fread(readBin(obj$Body "character"))

# method_2
writeBin(obj$Body, con = "test.csv")
data.table::fread("test.csv")

Data return size: 19 X 1,356,206 Data memory size: 335.3 Mb

DyfanJones commented 5 years ago

Unit: seconds

expr min lq mean median uq max neval
method_1 2.812445 2.878789 3.045556 2.947542 3.168486 3.932581 100
method_2 2.261809 2.355071 2.540992 2.421663 2.673163 3.606258 100
DyfanJones commented 5 years ago
# data.frame test
x <- 1e8
df <- data.frame(var1 = sample(LETTERS, x, replace = T),
                 var2 = runif(x))

# get df column class
type <- sapply(df, class)

# write df out as csv
write.csv(df, "test.csv", row.names = F)

# create binary format
bin_df <- readBin("test.csv", "raw", n = file.size("test.csv"))

# time to read: method 1
microbenchmark::microbenchmark(
  m1 = read.csv(text = readBin(bin_df, "character")),
  m2 = read.csv("test.csv"),
  m3 = data.table::fread(readBin(bin_df, "character"),showProgress = F),
  m4 = data.table::fread("test.csv", showProgress = F),
  m5 = readr::read_csv(bin_df, progress = F),
  m6 = readr::read_csv(readBin(bin_df, "character", progress = F)),
  m7 = readr::read_csv("test.csv", progress = F),
  m8 = read.csv(text = readBin(bin_df, "character"), col.names = names(type), colClasses = unname(type)),
  m9 = read.csv("test.csv", col.names = names(type), colClasses = unname(type)),
  m10 = data.table::fread(readBin(bin_df, "character"), col.names = names(type), colClasses = unname(type)),
  m11 = data.table::fread("test.csv", col.names = names(type), colClasses = unname(type)),
  m12 = readr::read_csv(bin_df, col_names = names(type),, skip =1, progress = F),
  m13 = readr::read_csv(readBin(bin_df, "character"), col_names = names(type), skip =1, progress = F),
  m14 = readr::read_csv("test.csv", col_names = names(type), skip =1, progress = F),
  times = 10)

data memory size: 1.7 Gb

DyfanJones commented 5 years ago

When converting Binary to large vector there is a chance of Error: vector memory exhausted (limit reached?) Due to this using a tempfile() seems a more safe method.

R character strings are limited to 2^31-1 bytes

Any dataframe over 2 GB R would throw a warning message using the large character method.

DyfanJones commented 5 years ago

For speed data.table will be adopted for the default read in method going forward