should other file types be split by default: benchmark

DyfanJones commented 3 years ago

When uploading data frames to AWS Athena, noctua only split compressed delimited files due to https://github.com/DyfanJones/RAthena/issues/36 . Would other file.types benefit from similar file splitting?

DyfanJones commented 3 years ago

These initial benchmarks were carried using dev branch "append-static-s3-dir" however previous version of noctua should be sufficient as this is benchmarking AWS Athena.

# method for install noctua branch: append-static-s3-dir
remotes::install_github("dyfanjones/noctua", ref = "append-static-s3-dir")

Benchmark code

library(DBI)
library(data.table)

X <- 1e8

value <- data.table(x = 1:X,
                    y = sample(letters, X, replace = T), 
                    z = sample(c(TRUE, FALSE), X, replace = T))

con <- dbConnect(noctua::athena())

dbWriteTable(con, "test_split1", value, file.type = "parquet", compress = T, overwrite = T)
dbWriteTable(con, "test_split2", value, file.type = "parquet", compress = T, overwrite = T, max.batch = .5*X)
dbWriteTable(con, "test_split3", value, file.type = "parquet", compress = T, overwrite = T, max.batch = .1*X)
dbWriteTable(con, "test_split4", value, file.type = "parquet", compress = T, overwrite = T, max.batch = .05*X)

library(microbenchmark)

res <- microbenchmark(
  files_1 = dbGetQuery(con, "select * from test_split1 limit 10"),
  files_2 = dbGetQuery(con, "select * from test_split2 limit 10"),
  files_10 = dbGetQuery(con, "select * from test_split3 limit 10"),
  files_20 = dbGetQuery(con, "select * from test_split4 limit 10"),
  times = 10
)

library(ggplot2)
autoplot(res) +
  labs(title = "AWS Athena benchmark with compressed parquet file",
       subtitle = "compression type: snappy") +
  theme_bw() + 
  theme(text = element_text(size=15))

parquet_benchmarks

From this initial benchmark it looks like splitting parquet does have its advantages.

DyfanJones commented 3 years ago

Next steps for benchmarking:

More complex sql queries
AWS Athena tables with different column data types
All the different file types noctua supports:
- csv/tsv (uncompressed)
- json
- parquet (uncompressed)

DyfanJones commented 3 years ago

Compressed Parquet group by:

Note: Data was taken from: https://github.com/DyfanJones/noctua/issues/112#issuecomment-703637550

res <- microbenchmark(
  files_1 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split1 group by y"),
  files_2 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split2 group by y"),
  files_10 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split3 group by y"),
  files_20 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split4 group by y"),
  times = 10
)

library(ggplot2)
autoplot(res) +
  labs(title = "AWS Athena benchmark with compressed parquet file",
       subtitle = "compression type: snappy") +
  theme_bw() + 
  theme(text = element_text(size=15))

parquet_group_benchmarks

DyfanJones commented 3 years ago

Compressed Parquet group by # 2:

Note: Data was taken from: #112 (comment)

res <- microbenchmark(
  files_1 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split1 group by y"),
  files_2 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split2 group by y"),
  files_10 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split3 group by y"),
  files_20 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split4 group by y"),
  times = 10
)

library(ggplot2)
autoplot(res) +
  labs(title = "AWS Athena benchmark with compressed parquet file",
       subtitle = "compression type: snappy") +
  theme_bw() + 
  theme(text = element_text(size=15))

parquet_range_group_benchmark

DyfanJones / noctua

should other file types be split by default: benchmark #112