Closed ruaridhw closed 7 years ago
Thank you very much.
@agstudy The current master will not function correctly as the dll file is not representative of the new C# source. (I don't believe the Makefiles or install.libs.R are being called on installation as opposed to merely replicating inst/libs?) I have just pushed the correct binaries and Lumenworks references to #28
Perhaps a new wiki page is better suited for this but here are some initial performance comparisons...
Using a test dataset with ~560K rows and a mix of character and numeric columns
large_dataset <- data.table::fread("data.csv")
dim(large_dataset)
#[1] 560918 8
table(sapply(large_dataset,class))
#character numeric
#6 2
Version: devtools::install_github('agstudy/rsqlserver@ecc48f0')
data.table object
profvis({dbBulkCopy(conn=conn, name=table.name, value=large_dataset)})
Peak memory: 74MB Total time: 22s
raw csv
profvis({dbBulkCopy(conn=conn, name=table.name, value="data.csv")})
Peak memory: 73MB Total time: 10s
Version: devtools::install_github('ruaridhw/rsqlserver@06c35e1')
data.table object
profvis({dbBulkCopy(conn=conn, name=table.name, value=large_dataset)})
Peak memory: 41MB Total time: 18s
raw csv
profvis({dbBulkCopy(conn=conn, name=table.name, value="data.csv")})
Peak memory: 39MB Total time: 7s
Sidenote -- large raw csv
profvis({dbBulkCopy(conn=conn, name=table.name, value="data_2.24milrows.csv")})
Total time: 69s Interestingly, x4 row count results in x10 time taken. The old method crashes and cannot attempt to load files this large due to the timeout enforced and that the entire file was read into memory through a C# data table prior to upload. I lifted the timeout just for this test but haven't committed any change to this as yet.
Speed reduction of 20-30% Peak memory reduction of nearly 50%
For completeness, "New method -- raw csv" should also be benchmarked against an old methodology of bulk copying raw csv files containing commas. I achieved this by reading in the dataset, replacing every comma with an arbitrary identifier, writing the dataset back to disk (dbBulkCopy takes care of this) and then post-processing in-database to replace the commas back in the dataset.
profvis({
data <- data.table::fread("data_withcommas.csv")
comma.replacement <- "<!COMMA>" # any unique pattern that can be uploaded in place of a comma
comma.cols <- names(data)[apply(data, 2, function(x) any(stringr::str_detect(x, stringr::fixed(",")), na.rm = T))]
#length(comma.cols)
#3
#half of the 6 character columns contain commas
# Replace all instances of commas in data
data[comma.cols] <- lapply(data[comma.cols], function(x) gsub(",", comma.replacement, x, fixed = T))
dbBulkCopy(conn=conn, name=table.name, value=data)
sql.comma.replacement <- paste0(
"UPDATE ", table.name, "
SET
", paste0("[", comma.cols, "] = REPLACE([", comma.cols, "], '", comma.replacement, "', ',')",
collapse = ",\n")
)
dbSendQuery(conn, sql.comma.replacement, timeout = 9999)
})
Peak memory: 120MB Total time: 63s
Some overhead could have been reduced using dt functions however the majority of the crunching is still done in dbBulkCopy. This equates to a speed reduction of 90% and peak memory reduction of 66%
Pull Request Changes