API redesign - Githubissues

cpsievert commented 9 years ago

This package is a wonderful idea! Having a ‘standard’ API for ETL would certainly be nice, especially if the community adopts it. However, if the goal is for people to use this, it has to be simple, and we should minimize it’s assumptions/dependencies. For these reasons, this pull request does the of following:

Changed license from GPL to MIT. If we force people into using GPL, they’ll be less likely to use it.
Encourages the pure, predictable, and pipeable paradigm, but doesn’t make assumptions about the central (etl) object, other than it has a ‘data’ property.
Uses the familiar extract, transform, load verbage ;)

If you think this is a good idea, I’ll start working on a vignette to show how to make some non-trivial extensions.

beanumber commented 9 years ago

Wow, thanks for the input! I figured I was all alone here... I would love to have you as a contributor on this!

I like some of what you've done, but I need more convincing on other parts. My thoughts:

I'm totally with you on the PPP issue. I hadn't seen that presentation, but it makes a lot of sense to me. As of this afternoon I was already starting to worry about the modifying the object thing. I'm game for keeping the etl object simple and getting rid of that aspect, but I'm not sure how best to get around it. More on that later...
I like what you did with the etl object. I was trying to avoid having to reference the DBIConnection object separately repeatedly, but maybe your way is better.
I'm less sold on the extract-transform-load paradigm. Or at least, in my experience it's important to break those into smaller steps. But maybe extract is just scrape, transform is just process and load is push. But I think you still want init and cleanup.
Why did you remove all of the piped operations from etl_update?
I know that the db* functions don't belong here, but I'm just stashing them here now. They should be pull requests to DBI. I need dbRunScript because I want the ability to execute hard-coded SQL table creation scripts. For the airlines, partitioning makes sense, but there is (presently) no functionality to implement partitions in DBI.
Again, for reasonably large data sets, having a local cache that isn't a temp directory is important. Otherwise, you'd have to sit there for hours every time you want to re-populate the DB. So why can't we keep that location in the object? It never changes.

Please let me know your thoughts! I can tell from your code that you know more about R than I do! :)

I think I will not accept this pull request, but I will incorporate some of your code and add you as a contributor.

cpsievert commented 9 years ago

Cool, thanks for the quick feedback. I intentionally oversimplified and removed stuff I wasn't sure about so we could talk through them. I hope we can keep this going because I think we can learn lots from each other ;)

An important concept in this design is that the central “etl object” doesn’t require a database connection to get started. This is important for a couple reasons:

Some might want to just extract/transform, and not work with databases at all. In my experience, a lot of pitchRx users just want to acquire the data then write to csv or something.
It’s better practice to initialize the database connection after extract/transform since the connection may be dropped if the extract/transform steps take a long time.

maybe extract is just scrape, transform is just process and load is push.

Yep! I just changed the wording so it would match ETL ;)

I think you still want init and cleanup.

I agree, cleanup is certainly useful, but I don’t know a ton about how this works in different database implementations. I did export a etl_cleanup() generic, but didn’t create a default method (I’m hoping you’ll take the reigns on that ;)

I’m not convinced we need an explicit init function. Have a look at the new implementation of etl_load.default(). Here are the important bits:

You can specify a table schema or let dbWriteTable determine that for you. If the relevant tables already exist, then we append to them, otherwise we create new table(s).
When appending to a table, it will automatically handle differing sets of variables (dbWriteTable isn’t smart enough to handle this on it’s own).
It’s general enough so that package authors don’t even have to define their own etl_load() method (just etl_extract() and etl_transform()).

I suppose a weakness with this approach is that it’d be easy to duplicate records. I have some half-baked ideas on how we might help prevent that, but I’m not sure if it will work generically. And, worst comes worst, we could provide a (probably inefficient) function to remove duplicates.

Why did you remove all of the piped operations from etl_update?

I’m not convinced we need etl_update() either. For example, I’m envisioning a workflow like this for starting/updating a database:

# first, initialize and load 
library(dplyr)
db <- src_sqlite(tempfile(fileext = ".sqlite3"), create = TRUE)
etl(“gameday”) %>%
  etl_extract(start = “2008-01-01”, end =  “2009-01-01”) %>%
  etl_transform() %>%
  etl_load(db)
# “update” it with next year’s data
etl(“gameday”) %>%
  etl_extract(start = “2009-01-01”, end =  “2010-01-01”) %>%
  etl_transform() %>%
  etl_load(db)

Again, for reasonably large data sets, having a local cache that isn't a temp directory is important.

Agreed. I’m using a temporary database in the examples just so users don’t have to setup/delete a database to run the examples. You don't have to use a temporary database, but you can if you want!

So why can't we keep that location in the object?

Why do we need the location? DBI methods just require the connection.

I think I will not accept this pull request

I hope you’ll consider merging this at some point instead of copying over code. If you think anything needs to be changed or can be improved, just let me know!

beanumber commented 9 years ago

OK, I think you have convinced me to move all references to the db object to the load function. I can see how your workflow makes sense.

And it seems like you are baking init into load. This makes sense, but we'll have to be very careful about this. I still think it might be nice in certain cases to provide the ability to control this explicitly. I'll take a closer look at your changes.

My idea for cleanup is two-fold: first, you might want to delete some or all of the files that you created during extract and transform. Second, you might want to perform some DB operations like VACUUM ANALYZE in PostgreSQL or something similar. So part of this is cleaning up after db operations, and part of it is not. So I'm not sure where that fits.

So maybe the workflow is:

library(dplyr)
db <- src_sqlite(tempfile(fileext = ".sqlite3"), create = TRUE)
etl(“gameday”) %>%
  etl_extract(start = “2008-01-01”, end =  “2009-01-01”) %>%
  etl_transform() %>%
  etl_load(db, init = TRUE)

And within etl_load you have the equivalent of etl_init. But why not provide:

etl("gameday") %>%
  etl_update(db, start = “2008-01-01”, end =  “2009-01-01”)

for convenience?

The primary key thing is another tricky one. I haven't fully digested how the overwrite and append arguments to dbWriteTable map to INSERT IGNORE and INSERT REPLACE. Again, this might be an issue for DBI.

beanumber commented 9 years ago

I haven't forgotten about this. I've just been tied up with the start of the semester. I hope to come back to this soon.

cpsievert commented 9 years ago

Yea, I've been busy too, and will be for at least the next month, so no rush.

BTW, have you seen/heard of dat? There is already an R package, and it might be worth looking into to add support for versioning (among other things).

beanumber commented 8 years ago

@cpsievert can you fetch upstream and push this again?

Also, I am having a really hard time understanding all of the changes, and that's part of why it's taken me so long to respond to this. I think most of your ideas are good, but I need to understand it so that it doesn't screw up what I'm trying to do downstream.

This might be really annoying or not even possible, but do you think you could separate this into two or three smaller pull requests, each centered around a single idea? If not, maybe we could talk through this sometime?

cpsievert commented 8 years ago

In a few weeks I might have time to make this more modular (I have a thesis proposal coming up).

keberwein commented 7 years ago

@beanumber @cpsievert Was this merge supposed to succeed? Ben, I talked to you a bit on Twitter yesterday about the package. I'm interested in building some parallel and multiprocessor stuff off of this framework.

What I have in mind right now would probably just be "vignette" type stuff. Should I work off Carson's fork?

beanumber commented 7 years ago

@keberwein No, I think this PR is very old now. The current master is where things are. I'd be very excited to have some help strengthening this. Feel free to fork and send a PR. Maybe we should talk about your plans?

keberwein commented 7 years ago

My basic plan is to build a proof-of-concept ETL for high-performance computing. Something that would be scalable enough to run in an enterprise production. My thought is, to use the parallel and multicore packages to build something on top of your framework. My test box has 32 cores, it would probably be a good candidate.

I might even try to use MLB data from Carson's package, there's definitely plenty of data there, and it would be reproducible.

Like I said, I think this would be more of a vignette. I'm a big fan of keeping packages as simple as possible (less stuff to break). However, I'm sure I'll have a few ideas for etl along the way.

beanumber / etl

API redesign #3