agstudy / rsqlserver

Sql Server driver database interface (DBI) driver for R. This is a DBI-compliant Sql Server driver based on the System.Data.SqlClient.
82 stars 30 forks source link

dbBulkCopy feature enhancement #27

Closed ruaridhw closed 7 years ago

ruaridhw commented 7 years ago

Pull Request Changes

agstudy commented 7 years ago

Thank you very much.

ruaridhw commented 7 years ago

@agstudy The current master will not function correctly as the dll file is not representative of the new C# source. (I don't believe the Makefiles or install.libs.R are being called on installation as opposed to merely replicating inst/libs?) I have just pushed the correct binaries and Lumenworks references to #28

ruaridhw commented 7 years ago

Performance Changes

Perhaps a new wiki page is better suited for this but here are some initial performance comparisons...

Using a test dataset with ~560K rows and a mix of character and numeric columns

large_dataset <- data.table::fread("data.csv")
dim(large_dataset)
#[1] 560918     8
table(sapply(large_dataset,class))
#character   numeric 
#6           2

Old method

Version: devtools::install_github('agstudy/rsqlserver@ecc48f0')

data.table object

profvis({dbBulkCopy(conn=conn, name=table.name, value=large_dataset)})

Peak memory: 74MB Total time: 22s

raw csv

profvis({dbBulkCopy(conn=conn, name=table.name, value="data.csv")})

Peak memory: 73MB Total time: 10s

New method

Version: devtools::install_github('ruaridhw/rsqlserver@06c35e1')

data.table object

profvis({dbBulkCopy(conn=conn, name=table.name, value=large_dataset)})

Peak memory: 41MB Total time: 18s

raw csv

profvis({dbBulkCopy(conn=conn, name=table.name, value="data.csv")})

Peak memory: 39MB Total time: 7s

Sidenote -- large raw csv

profvis({dbBulkCopy(conn=conn, name=table.name, value="data_2.24milrows.csv")})

Total time: 69s Interestingly, x4 row count results in x10 time taken. The old method crashes and cannot attempt to load files this large due to the timeout enforced and that the entire file was read into memory through a C# data table prior to upload. I lifted the timeout just for this test but haven't committed any change to this as yet.

Overall Results

Speed reduction of 20-30% Peak memory reduction of nearly 50%

For completeness, "New method -- raw csv" should also be benchmarked against an old methodology of bulk copying raw csv files containing commas. I achieved this by reading in the dataset, replacing every comma with an arbitrary identifier, writing the dataset back to disk (dbBulkCopy takes care of this) and then post-processing in-database to replace the commas back in the dataset.

profvis({
  data <- data.table::fread("data_withcommas.csv")

  comma.replacement <- "<!COMMA>" # any unique pattern that can be uploaded in place of a comma
  comma.cols <- names(data)[apply(data, 2, function(x) any(stringr::str_detect(x, stringr::fixed(",")), na.rm = T))]
  #length(comma.cols)
  #3
  #half of the 6 character columns contain commas

  # Replace all instances of commas in data
  data[comma.cols] <- lapply(data[comma.cols], function(x) gsub(",", comma.replacement, x, fixed = T))

  dbBulkCopy(conn=conn, name=table.name, value=data)

  sql.comma.replacement <- paste0(
    "UPDATE ", table.name, "
       SET
      ", paste0("[", comma.cols, "] = REPLACE([", comma.cols, "], '", comma.replacement, "', ',')",
                collapse = ",\n")
  )
  dbSendQuery(conn, sql.comma.replacement, timeout = 9999)
})

Peak memory: 120MB Total time: 63s

Some overhead could have been reduced using dt functions however the majority of the crunching is still done in dbBulkCopy. This equates to a speed reduction of 90% and peak memory reduction of 66%