beanumber / etl

R package to facilitate ETL operations
127 stars 21 forks source link

Should etl objects always extend src objects? #15

Closed cpsievert closed 8 years ago

cpsievert commented 8 years ago

In #3, the central etl object doesn't extend a src object unless a database is specified in etl_load(). If you don't provide a database, that could mean 'load data into R':

etl('foo') %>%
  etl_extract() %>%
  etl_transform() %>% 
  etl_load()
# returns a list of data frames

But, if you do provide a database, that means 'load data into this database'

db <- dplyr::src_sqlite(tempfile(), create = T)
etl('foo') %>%
  etl_extract() %>%
  etl_transform() %>% 
  etl_load(db)
# returns a src object pointing to the database

Although having consistency in the central object's data structure is important, I think this added flexibility is worth it since potential package authors probably won't have DBI/SQL chops. What do you think @beanumber? If we go this route, I'll have to make some backwards incompatible changes.

beanumber commented 8 years ago

Hmm...I think we are talking about two things here:

  1. Should you be forced to use an SQL backend, or should you be allowed to etl_load the data into memory?
  2. Should the db be associated only with etl_load?

For 1, I could see this being fine, but it seems to me that this would be a lot cleaner if the result of etl_load always inherited from dplyr::src. I guess it wouldn't have to be src_sql. But if it was just a list of data frames that seems like it would get confusing.

The counter-argument is that why would you ever want a list of data frames instead of a src_sqlite? With dplyr, you don't have to have SQL chops to work with SQL data. My understanding is that SQLite is a more efficient storage format than CSVs if you are working out of memory, and you can write a SQLite database to memory if you want. If we give people the option of not using SQL because they don't know it, then they won't use it. But then they'll miss out on all the benefits they could have had. What happens if someone tries to load the whole airlines data set into memory? I envision this package as providing functionality for "medium" data, so defaulting to in memory could be problematic. Would you be satisfied if the default location for the SQLite database was :memory: instead of a temp directory?

Another option would be to write the SQLite database and return a list of tbls. But here again, I'd worry that it could be very confusing to have the result be either a src_sql or a list of tables.

What if we wrote another function like use that would read all of the tables in an etl object into the environment?

my_src <- etl('foo') %>%
  etl_extract() %>%
  etl_transform() %>% 
  etl_load()
# my_src extends dplyr::src_sqlite
my_src <- use()
ls() # the tables are now in the environment

For 2, I think I tried this and I didn't see how to make it work, or what the savings was. I see what you are saying about the db connection only being necessary for the load phase. The problem is that for etl_extract, you need a place to put the raw data. And for etl_transform you need to know where the raw data is, and where to put the cleaned data. And then for etl_load, you need to know where to find the cleaned data, and where to put it. So instead of having to do something like:

etl('foo') %>%
  etl_extract(raw = "/tmp/foo") %>%
  etl_transform(raw = "/tmp/foo", clean = "~/data/foo") %>%
  etl_load(clean = "~/data/foo", db = my_db)

It just seemed easier and more logical to have all of those properties associated with the etl object at instantiation.

etl('foo', dir = "~/data/foo", db = my_db) %>%
  etl_extract() %>%
  etl_transform() %>%
  etl_load()

But if you disagree I'm happy to reconsider.