Closed karldw closed 6 years ago
It seems like postgres-copy ought to be something that's eventually made available via sqlalchemy -- we're certainly not the only people trying to pull in big tables quickly, and I think many databases have this kind of low-level, no-validation text-slurping functionality. It might be worth poking around to see if it's been added. I remember looking at odo when we were trying to speed this process up, and don't remember why I ended up not using it. It should be pretty easy to test to_sql & postgres copy side by side -- if you run the extract & transform steps (once @cmgosnell checks in her re-org, if it's not all in already) you can take the prepared dataframe and try to load it using the CSV dump loader, and to_sql on one of the big EIA 923 tables.
My guess is the dbfread package will never be part of a standard python distribution, and we need it to jailbreak FERC Form 1, so I think there'll always need to be a bit of additional package installation -- have you done any python packaging work? Eventually it'd be nice if we could codify these dependencies and give folks a python setup.py install option that just grabs all these requirements and puts them where they need to be. Is there a standard way to pull dependencies using git? Or do they need to be available via e.g. pip?
Seems easy enough to test!
I've looked at python packaging, but never done it in a real project. I'm guessing the easiest way is to have a good setup.py, and eventually publish PUDL on PyPI, with dbfread listed as a dependency. I don't know about pulling dependencies with git, but that sounds harder.
FYI: pandas 0.24.0 will have more direct means to COPY data: https://github.com/pandas-dev/pandas/pull/21401
Mostly because I'm curious, but also in case pandas'
to_sql
dramatically improves in the future. (For new users, it would be nice if all the packages were in Anaconda or at least conda-forge. sqlalchemy-postgres-copy is not.)Here are other approaches that could be thrown into the comparison: