beanumber / etl

R package to facilitate ETL operations
127 stars 21 forks source link

write etl_load_side() methods for MySQL and PostgreSQL #12

Closed beanumber closed 8 years ago

beanumber commented 8 years ago

Using DBI::dbWriteTable() requires that the data that you push to the RDBMS be in memory in R. This is fine for most things, since you already have the data in memory. But here, we might have the data in CSVs that we just want to pump directly into the RDBMS, and R really needn't be involved. The right way to do this is with LOAD DATA INFILE ..., which should be way faster.

So write a method called etl_load_side.src_mysql() that will replace dbWriteTable() with dbSendQuery("LOAD DATA INFILE ...").

beanumber commented 8 years ago

Upon further review, it looks like this functionality already exists for MySQL

beanumber commented 8 years ago

and RPostgreSQL

beanumber commented 8 years ago

but not RSQLite

beanumber commented 8 years ago

I think the way to do this is, instead of

table <- read_csv(csv)
dbWriteTable(con, name, table) 

roll your own

load_sql <- "LOAD DATA INFILE csv INTO TABLE name (column1, @dummy, column2, @dummy, column3);"
mysqlExecQuery(con, load_sql)

My guess is that this will be significantly faster.

beanumber commented 8 years ago

This is not really an etl issue, but derivative package maintainers should be aware of the issues. Also, the file writing method for SQLite exists now.