Closed DyfanJones closed 5 years ago
Unit: seconds
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
method_1 | 2.812445 | 2.878789 | 3.045556 | 2.947542 | 3.168486 | 3.932581 | 100 |
method_2 | 2.261809 | 2.355071 | 2.540992 | 2.421663 | 2.673163 | 3.606258 | 100 |
# data.frame test
x <- 1e8
df <- data.frame(var1 = sample(LETTERS, x, replace = T),
var2 = runif(x))
# get df column class
type <- sapply(df, class)
# write df out as csv
write.csv(df, "test.csv", row.names = F)
# create binary format
bin_df <- readBin("test.csv", "raw", n = file.size("test.csv"))
# time to read: method 1
microbenchmark::microbenchmark(
m1 = read.csv(text = readBin(bin_df, "character")),
m2 = read.csv("test.csv"),
m3 = data.table::fread(readBin(bin_df, "character"),showProgress = F),
m4 = data.table::fread("test.csv", showProgress = F),
m5 = readr::read_csv(bin_df, progress = F),
m6 = readr::read_csv(readBin(bin_df, "character", progress = F)),
m7 = readr::read_csv("test.csv", progress = F),
m8 = read.csv(text = readBin(bin_df, "character"), col.names = names(type), colClasses = unname(type)),
m9 = read.csv("test.csv", col.names = names(type), colClasses = unname(type)),
m10 = data.table::fread(readBin(bin_df, "character"), col.names = names(type), colClasses = unname(type)),
m11 = data.table::fread("test.csv", col.names = names(type), colClasses = unname(type)),
m12 = readr::read_csv(bin_df, col_names = names(type),, skip =1, progress = F),
m13 = readr::read_csv(readBin(bin_df, "character"), col_names = names(type), skip =1, progress = F),
m14 = readr::read_csv("test.csv", col_names = names(type), skip =1, progress = F),
times = 10)
data memory size: 1.7 Gb
When converting Binary to large vector there is a chance of Error: vector memory exhausted (limit reached?)
Due to this using a tempfile()
seems a more safe method.
R character strings are limited to 2^31-1 bytes
Any dataframe over 2 GB R would throw a warning message using the large character method.
For speed data.table will be adopted for the default read in method going forward
Data return size: 19 X 1,356,206 Data memory size: 335.3 Mb