agstudy / rsqlserver

Sql Server driver database interface (DBI) driver for R. This is a DBI-compliant Sql Server driver based on the System.Data.SqlClient.
82 stars 30 forks source link

Have you ever tried using BulkCopy to read the data #20

Closed shrektan closed 9 years ago

shrektan commented 9 years ago

Hi, just find your package and personally I think it's very useful and interesting.

After reading your introduction, I realized that you've used .Net framework to read and write, and provide a function to use BCP to write table.

So, I'm just wondering why not just use BCP to both read and write at all and remove the sqlClient's part.

I have to admit that I'm not very familiar with the sqlServer stuff in detail, however, I've built two small function in my working environment through BCP to both read and write. For example about Query, it will first use BCP to write a csv file in my temp folder, then it will use data.table::freadto read the csv file as fast as possible. In usually circumstances it seems work ok and very fast, except that if there's too many text the csv file might be failed to read (like containing the seperator in the data).

So, I just want to figure out the reason why not use BCP for everything.

Thanks.

agstudy commented 9 years ago

Hi, I have never use bcp for reading. But I admit I should add it the TODO task? Maybe I should add , I will add a new function ( any help is welcome)

dbBulckRead

I will add it once I finish to integrate the mono version ( in the coming days).

shrektan commented 9 years ago

@agstudy Thanks for your reply. Below is my version of implementation the BCP read, just a raw version, but I hope it could be helpful.

MONO

By the way, I'm also interested in your plan of MONO. It will enable people to connect the MSSQL in a Linux machine, right? But how about the speed? Will it be as fast as in the Windows? I have to confess I really know nothing of Mono.

The reason that I want to know is that the Shiny Server of Rstudio can only be run in a Linux machine and via the JDBC is kind of slow in terms of speed. So, if your method is very good in speed, then it will be great for all of us using MSSQL.

Other ideas

I think the dplyr package is cool especially the part related to use R syntax to SQL. It achieves a kind of seamless way of analysis in R and I'm really interested in it. However, the dplyr doesn't contain the part related to MSSQLSERVER because Hadley only wants to work for the open source SQL database.

So, maybe we can consider to implement the generic functions for sqlserver, so that people can use dplyr package directly for analysis in MSSQLSERVER?

My codes

f_bcp_query <- function(conn, sql, if_str = FALSE, if_keep_tmp = FALSE) {

  if (if_str == FALSE) {
    conn <- as.character(substitute(conn))
  }

  # prepare temporary file and folder
  tempdir <- tempdir()
  tempfile <- tempfile(tmpdir = tempdir, fileext = ".csv")

  if (if_keep_tmp) f_msg("Will storage the tmp file in ", tempfile)

  # download data and header via BCP function
  f_bcp <- function(sql, csv_name = tempfile, if_echo = FALSE,
                    hdr = TRUE, if_debug = FALSE) {

    stopifnot(str_right(csv_name, 4) == ".csv")

    tmp <- f_conn_info(conn)

    user = tmp$UID
    pwd = tmp$PWD
    db = tmp$SERVER

    if (hdr) {
      conn <- activate_conn(conn, if_str = TRUE)
      tmp <- sqlQuery(
        conn,
        paste0("select top 0 * from (", sql, ") a")
      )
      tmp2 <- gsub(".csv", "_hdr.csv", csv_name, fixed = TRUE)
      write.csv(tmp, tmp2, row.names = FALSE)
      odbcClose(conn)
    }

    str_cmd <-
      'bcp "%s" queryout "%s" -U "%s" -P "%s"
  -S "%s" -c -t "|" -b 10000 -a 10000'
    str_cmd <- sprintf(str_cmd, sql, csv_name, user, pwd, db)
    str_cmd <- gsub("\n", " ", str_cmd, fixed = TRUE)
    # f_msg(str_cmd)
    tmp <- invisible(shell(str_cmd, intern = !if_echo))

    if(if_debug) {
      return(str_cmd)
    } else {
      return(tmp)
    }

  }

  # use fread to read data
  f_fread <- function(csv_name = tempfile) {
    csv_hdr <- gsub(".csv", "_hdr.csv", csv_name, fixed = TRUE)
    if (!if_keep_tmp) {
      on.exit({
        unlink(csv_name)
        unlink(csv_hdr)
      })
    }
    tmp <- data.table::fread(
      csv_name, sep = "|", header = FALSE, integer64 = "double"
    )
    tmp_hdr <- read.csv(csv_hdr)
    data.table::setnames(tmp, colnames(tmp), colnames(tmp_hdr))
    tmp
  }

  f_bcp <- f_bcp(sql)

  return(f_fread())
}

Many thanks!

agstudy commented 9 years ago

1- Thank you for you suggestion.

2- Of course the mono integration will keep the same speed ( at least will be faster than other providers ) and you can access the Sql Server from Linux machine.

3- dplyr is higher interface based on DBI to get rid of sql specific query. rsqlserver is fully DBI compliant so I believe that you can already create this interface. Personally I don't use dplyr. I have no problem to use DBI interface directly. ( Myabe my SQL background).

4- Thank for your f_bcp_query sample , can you please create a pull request with a test case for this. I will integrate it as soon as possible.

shrektan commented 9 years ago

Thanks.

  1. I will create a PR as long as I figure out how to fix some known issues, like the data can't be read if the strings in data contains empty string.
  2. OK. I'll try to figure out how to implement the dplyr backend. And if it becomes matured and you're interested, I hope I can create PR then.