Stephen-McDaniel / rpostgresql

This repository is an export of the final version from the retired Google Code system (code.google.com/p/rpostgresql).
0 stars 0 forks source link

Ideas: more data types for dbWriteTable and is there any way to adapt pipe-to-psql approach in attached? #33

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Some data type mapping thoughts:
logical -> boolean
Data -> date

Also, sometimes text doesn't seem to work well (e.g., storing firm identifiers 
like CUSIPs as text seems to make indexing difficult). The attached has some 
ideas here.

Note that the attached is a kluge I'm using to deal with the fact that 
dbWriteTable doesn't handle backslashes well. But performance of the 
pipe-to-psql approach seems pretty good; do you think could some aspects of 
this be replicated in the library/package?

 load("~/Desktop/test.data.Rdata")
> source('~/Desktop/dbWriteTable.fast.R')
> library(RPostgreSQL) # using r181
> drv <- dbDriver("PostgreSQL")
> pg <- dbConnect(drv, dbname = "crsp", host="localhost")
> system.time(rs <- dbWriteTable(pg,"testdata",test.data, overwrite=TRUE, 
row.names=FALSE))
Error in postgresqlgetResult(new.con) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  invalid input syntax for type double precision: "Chairman and Chief Executive Officer"
CONTEXT:  COPY testdata, line 707, column Bonus: "Chairman and Chief Executive 
Officer"
)
Timing stopped at: 0.047 0.002 0.075 
> # Problem is backslashes (show here as "\\")
> test.data[707,]
         fyend gvkey   Executive_ID fyear Company_ID       Executive               Company  Bonus
707 2000-06-30  <NA> 3061.7.N.23191  2000   3061.7.N KIGHT, PETER J. CHECKFREE 
CORP \\GA\\ 551850
                                   Title salary
707 Chairman and Chief Executive Officer 420000
> # Note that following produces "CHECKFREE CORP \GA\" on PostgreSQL side, as 
expected
> system.time(rs <- dbWriteTable.fast(pg,"testdata",test.data, overwrite=TRUE, 
row.names=FALSE))
   user  system elapsed 
  0.026   0.004   0.061 
> dbGetQuery(pg,"SELECT * FROM testdata LIMIT 2")
       fyend  gvkey   executive_id fyear company_id           executive       company bonus
1 2000-06-30 010777 7265.7.N.72074  2000   7265.7.N PELLIGRINO, PAUL A. TWIN 
DISC INC 24400
2 2000-06-30 009023     3309.3.N.6  2000   3309.3.N        OSTEN, JOZEF       
DVI INC 40000
                       title salary
1 Vice President Engineering 134846
2             Vice President 279796
> dbDisconnect(pg)
[1] TRUE
> # Now try a remote connection
> pg <- dbConnect(drv, db="crsp", host=my.office.computer)
> system.time(rs <- dbWriteTable(pg,"testdata",test.data, overwrite=TRUE, 
row.names=FALSE))
Error in postgresqlgetResult(new.con) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  invalid input syntax for type double precision: "Chairman and Chief Executive Officer"
CONTEXT:  COPY testdata, line 707, column Bonus: "Chairman and Chief Executive 
Officer"
)
Timing stopped at: 0.059 0.004 0.626 
> system.time(rs <- dbWriteTable.fast(pg,"testdata",test.data,  overwrite=TRUE, 
row.names=FALSE))
   user  system elapsed 
  0.028   0.007   0.523 
> dbDisconnect(pg)
[1] TRUE
> # Now try a clean data set
> pg <- dbConnect(drv, dbname = "crsp", host="localhost")
> test.data.2 <- rnorm(1e6)
> dim(test.data.2)[1] <- 100000
> dim(test.data.2) <- c(100000,10)
> test.data.2 <- as.data.frame(test.data.2)
> pg <- dbConnect(drv, db="crsp", host="localhost")
> system.time(rs <- dbWriteTable(pg,"testdata2",test.data.2, overwrite=TRUE, 
row.names=FALSE))
   user  system elapsed 
  1.996   0.110   3.509 
> system.time(rs <- dbWriteTable.fast(pg,"testdata2",test.data.2, 
overwrite=TRUE, row.names=FALSE))
   user  system elapsed 
  2.080   0.072   2.245 
> 

Original issue reported on code.google.com by iand...@gmail.com on 5 May 2011 at 10:21

Attachments:

GoogleCodeExporter commented 9 years ago
Meant to attach test.data.Rdata, not test.data.R. Sorry.

Original comment by iand...@gmail.com on 5 May 2011 at 10:24

Attachments:

GoogleCodeExporter commented 9 years ago
More datatype support is considered as enhancement.

pipe-to-psql does use as the same CPU time, and require to tell the right path 
to psql, which may not really exist in the system.  In terms of elapsed time, 
perhaps the write to temporary file is apparently a waste of time.  That might 
be removed during constructing proper escaping mechanism and rewrite of the 
structure.  If the time gain comes from parallel processing it might not be as 
fast as a pipe, though.

Original comment by tomoa...@kenroku.kanazawa-u.ac.jp on 24 Aug 2011 at 12:40

GoogleCodeExporter commented 9 years ago
logical <-> boolean

should work as is.

Original comment by tomoa...@kenroku.kanazawa-u.ac.jp on 25 Sep 2011 at 9:34

GoogleCodeExporter commented 9 years ago
This issue will be closed as the escape problem should have been 
fixed by r182 and r183.

For others (data conversion and performance), if you feel 
unsatisfied with 0.2, please raise an independent new issue.

Original comment by tomoa...@kenroku.kanazawa-u.ac.jp on 10 Oct 2011 at 7:47