Closed cpsievert closed 6 years ago
Wow, thanks for the input! I figured I was all alone here... I would love to have you as a contributor on this!
I like some of what you've done, but I need more convincing on other parts. My thoughts:
etl
object simple and getting rid of that aspect, but I'm not sure how best to get around it. More on that later...etl
object. I was trying to avoid having to reference the DBIConnection
object separately repeatedly, but maybe your way is better. extract
is just scrape
, transform
is just process
and load
is push
. But I think you still want init
and cleanup
. etl_update
?db*
functions don't belong here, but I'm just stashing them here now. They should be pull requests to DBI
. I need dbRunScript
because I want the ability to execute hard-coded SQL table creation scripts. For the airlines
, partitioning makes sense, but there is (presently) no functionality to implement partitions in DBI
. Please let me know your thoughts! I can tell from your code that you know more about R than I do! :)
I think I will not accept this pull request, but I will incorporate some of your code and add you as a contributor.
Cool, thanks for the quick feedback. I intentionally oversimplified and removed stuff I wasn't sure about so we could talk through them. I hope we can keep this going because I think we can learn lots from each other ;)
An important concept in this design is that the central “etl object” doesn’t require a database connection to get started. This is important for a couple reasons:
maybe
extract
is justscrape
,transform
is justprocess
andload
ispush
.
Yep! I just changed the wording so it would match ETL ;)
I think you still want
init
andcleanup
.
I agree, cleanup
is certainly useful, but I don’t know a ton about how this works in different database implementations. I did export a etl_cleanup()
generic, but didn’t create a default method (I’m hoping you’ll take the reigns on that ;)
I’m not convinced we need an explicit init
function. Have a look at the new implementation of etl_load.default()
. Here are the important bits:
dbWriteTable
determine that for you. If the relevant tables already exist, then we append to them, otherwise we create new table(s).dbWriteTable
isn’t smart enough to handle this on it’s own). etl_load()
method (just etl_extract()
and etl_transform()
).I suppose a weakness with this approach is that it’d be easy to duplicate records. I have some half-baked ideas on how we might help prevent that, but I’m not sure if it will work generically. And, worst comes worst, we could provide a (probably inefficient) function to remove duplicates.
Why did you remove all of the piped operations from
etl_update
?
I’m not convinced we need etl_update()
either. For example, I’m envisioning a workflow like this for starting/updating a database:
# first, initialize and load
library(dplyr)
db <- src_sqlite(tempfile(fileext = ".sqlite3"), create = TRUE)
etl(“gameday”) %>%
etl_extract(start = “2008-01-01”, end = “2009-01-01”) %>%
etl_transform() %>%
etl_load(db)
# “update” it with next year’s data
etl(“gameday”) %>%
etl_extract(start = “2009-01-01”, end = “2010-01-01”) %>%
etl_transform() %>%
etl_load(db)
Again, for reasonably large data sets, having a local cache that isn't a temp directory is important.
Agreed. I’m using a temporary database in the examples just so users don’t have to setup/delete a database to run the examples. You don't have to use a temporary database, but you can if you want!
So why can't we keep that location in the object?
Why do we need the location? DBI methods just require the connection.
I think I will not accept this pull request
I hope you’ll consider merging this at some point instead of copying over code. If you think anything needs to be changed or can be improved, just let me know!
OK, I think you have convinced me to move all references to the db
object to the load
function. I can see how your workflow makes sense.
And it seems like you are baking init
into load
. This makes sense, but we'll have to be very careful about this. I still think it might be nice in certain cases to provide the ability to control this explicitly. I'll take a closer look at your changes.
My idea for cleanup
is two-fold: first, you might want to delete some or all of the files that you created during extract
and transform
. Second, you might want to perform some DB operations like VACUUM ANALYZE
in PostgreSQL or something similar. So part of this is cleaning up after db operations, and part of it is not. So I'm not sure where that fits.
So maybe the workflow is:
library(dplyr)
db <- src_sqlite(tempfile(fileext = ".sqlite3"), create = TRUE)
etl(“gameday”) %>%
etl_extract(start = “2008-01-01”, end = “2009-01-01”) %>%
etl_transform() %>%
etl_load(db, init = TRUE)
And within etl_load
you have the equivalent of etl_init
. But why not provide:
etl("gameday") %>%
etl_update(db, start = “2008-01-01”, end = “2009-01-01”)
for convenience?
The primary key thing is another tricky one. I haven't fully digested how the overwrite
and append
arguments to dbWriteTable
map to INSERT IGNORE
and INSERT REPLACE
. Again, this might be an issue for DBI
.
I haven't forgotten about this. I've just been tied up with the start of the semester. I hope to come back to this soon.
@cpsievert can you fetch upstream and push this again?
Also, I am having a really hard time understanding all of the changes, and that's part of why it's taken me so long to respond to this. I think most of your ideas are good, but I need to understand it so that it doesn't screw up what I'm trying to do downstream.
This might be really annoying or not even possible, but do you think you could separate this into two or three smaller pull requests, each centered around a single idea? If not, maybe we could talk through this sometime?
In a few weeks I might have time to make this more modular (I have a thesis proposal coming up).
@beanumber @cpsievert Was this merge supposed to succeed? Ben, I talked to you a bit on Twitter yesterday about the package. I'm interested in building some parallel and multiprocessor stuff off of this framework.
What I have in mind right now would probably just be "vignette" type stuff. Should I work off Carson's fork?
@keberwein No, I think this PR is very old now. The current master is where things are. I'd be very excited to have some help strengthening this. Feel free to fork and send a PR. Maybe we should talk about your plans?
My basic plan is to build a proof-of-concept ETL for high-performance computing. Something that would be scalable enough to run in an enterprise production. My thought is, to use the parallel
and multicore
packages to build something on top of your framework. My test box has 32 cores, it would probably be a good candidate.
I might even try to use MLB data from Carson's package, there's definitely plenty of data there, and it would be reproducible.
Like I said, I think this would be more of a vignette. I'm a big fan of keeping packages as simple as possible (less stuff to break). However, I'm sure I'll have a few ideas for etl
along the way.
This package is a wonderful idea! Having a ‘standard’ API for ETL would certainly be nice, especially if the community adopts it. However, if the goal is for people to use this, it has to be simple, and we should minimize it’s assumptions/dependencies. For these reasons, this pull request does the of following:
If you think this is a good idea, I’ll start working on a vignette to show how to make some non-trivial extensions.