beanumber / etl

R package to facilitate ETL operations
127 stars 21 forks source link

API redesign #3

Closed cpsievert closed 6 years ago

cpsievert commented 9 years ago

This package is a wonderful idea! Having a ‘standard’ API for ETL would certainly be nice, especially if the community adopts it. However, if the goal is for people to use this, it has to be simple, and we should minimize it’s assumptions/dependencies. For these reasons, this pull request does the of following:

  1. Changed license from GPL to MIT. If we force people into using GPL, they’ll be less likely to use it.
  2. Encourages the pure, predictable, and pipeable paradigm, but doesn’t make assumptions about the central (etl) object, other than it has a ‘data’ property.
  3. Uses the familiar extract, transform, load verbage ;)

If you think this is a good idea, I’ll start working on a vignette to show how to make some non-trivial extensions.

beanumber commented 9 years ago

Wow, thanks for the input! I figured I was all alone here... I would love to have you as a contributor on this!

I like some of what you've done, but I need more convincing on other parts. My thoughts:

Please let me know your thoughts! I can tell from your code that you know more about R than I do! :)

I think I will not accept this pull request, but I will incorporate some of your code and add you as a contributor.

cpsievert commented 9 years ago

Cool, thanks for the quick feedback. I intentionally oversimplified and removed stuff I wasn't sure about so we could talk through them. I hope we can keep this going because I think we can learn lots from each other ;)

An important concept in this design is that the central “etl object” doesn’t require a database connection to get started. This is important for a couple reasons:

  1. Some might want to just extract/transform, and not work with databases at all. In my experience, a lot of pitchRx users just want to acquire the data then write to csv or something.
  2. It’s better practice to initialize the database connection after extract/transform since the connection may be dropped if the extract/transform steps take a long time.

maybe extract is just scrape, transform is just process and load is push.

Yep! I just changed the wording so it would match ETL ;)

I think you still want init and cleanup.

I agree, cleanup is certainly useful, but I don’t know a ton about how this works in different database implementations. I did export a etl_cleanup() generic, but didn’t create a default method (I’m hoping you’ll take the reigns on that ;)

I’m not convinced we need an explicit init function. Have a look at the new implementation of etl_load.default(). Here are the important bits:

I suppose a weakness with this approach is that it’d be easy to duplicate records. I have some half-baked ideas on how we might help prevent that, but I’m not sure if it will work generically. And, worst comes worst, we could provide a (probably inefficient) function to remove duplicates.

Why did you remove all of the piped operations from etl_update?

I’m not convinced we need etl_update() either. For example, I’m envisioning a workflow like this for starting/updating a database:

# first, initialize and load 
library(dplyr)
db <- src_sqlite(tempfile(fileext = ".sqlite3"), create = TRUE)
etl(“gameday”) %>%
  etl_extract(start = “2008-01-01”, end =  “2009-01-01”) %>%
  etl_transform() %>%
  etl_load(db)
# “update” it with next year’s data
etl(“gameday”) %>%
  etl_extract(start = “2009-01-01”, end =  “2010-01-01”) %>%
  etl_transform() %>%
  etl_load(db)

Again, for reasonably large data sets, having a local cache that isn't a temp directory is important.

Agreed. I’m using a temporary database in the examples just so users don’t have to setup/delete a database to run the examples. You don't have to use a temporary database, but you can if you want!

So why can't we keep that location in the object?

Why do we need the location? DBI methods just require the connection.

I think I will not accept this pull request

I hope you’ll consider merging this at some point instead of copying over code. If you think anything needs to be changed or can be improved, just let me know!

beanumber commented 9 years ago

OK, I think you have convinced me to move all references to the db object to the load function. I can see how your workflow makes sense.

And it seems like you are baking init into load. This makes sense, but we'll have to be very careful about this. I still think it might be nice in certain cases to provide the ability to control this explicitly. I'll take a closer look at your changes.

My idea for cleanup is two-fold: first, you might want to delete some or all of the files that you created during extract and transform. Second, you might want to perform some DB operations like VACUUM ANALYZE in PostgreSQL or something similar. So part of this is cleaning up after db operations, and part of it is not. So I'm not sure where that fits.

So maybe the workflow is:

library(dplyr)
db <- src_sqlite(tempfile(fileext = ".sqlite3"), create = TRUE)
etl(“gameday”) %>%
  etl_extract(start = “2008-01-01”, end =  “2009-01-01”) %>%
  etl_transform() %>%
  etl_load(db, init = TRUE)

And within etl_load you have the equivalent of etl_init. But why not provide:

etl("gameday") %>%
  etl_update(db, start = “2008-01-01”, end =  “2009-01-01”)

for convenience?

The primary key thing is another tricky one. I haven't fully digested how the overwrite and append arguments to dbWriteTable map to INSERT IGNORE and INSERT REPLACE. Again, this might be an issue for DBI.

beanumber commented 9 years ago

I haven't forgotten about this. I've just been tied up with the start of the semester. I hope to come back to this soon.

cpsievert commented 9 years ago

Yea, I've been busy too, and will be for at least the next month, so no rush.

BTW, have you seen/heard of dat? There is already an R package, and it might be worth looking into to add support for versioning (among other things).

beanumber commented 8 years ago

@cpsievert can you fetch upstream and push this again?

Also, I am having a really hard time understanding all of the changes, and that's part of why it's taken me so long to respond to this. I think most of your ideas are good, but I need to understand it so that it doesn't screw up what I'm trying to do downstream.

This might be really annoying or not even possible, but do you think you could separate this into two or three smaller pull requests, each centered around a single idea? If not, maybe we could talk through this sometime?

cpsievert commented 8 years ago

In a few weeks I might have time to make this more modular (I have a thesis proposal coming up).

keberwein commented 7 years ago

@beanumber @cpsievert Was this merge supposed to succeed? Ben, I talked to you a bit on Twitter yesterday about the package. I'm interested in building some parallel and multiprocessor stuff off of this framework.

What I have in mind right now would probably just be "vignette" type stuff. Should I work off Carson's fork?

beanumber commented 7 years ago

@keberwein No, I think this PR is very old now. The current master is where things are. I'd be very excited to have some help strengthening this. Feel free to fork and send a PR. Maybe we should talk about your plans?

keberwein commented 7 years ago

My basic plan is to build a proof-of-concept ETL for high-performance computing. Something that would be scalable enough to run in an enterprise production. My thought is, to use the parallel and multicore packages to build something on top of your framework. My test box has 32 cores, it would probably be a good candidate.

I might even try to use MLB data from Carson's package, there's definitely plenty of data there, and it would be reproducible.

Like I said, I think this would be more of a vignette. I'm a big fan of keeping packages as simple as possible (less stuff to break). However, I'm sure I'll have a few ideas for etl along the way.