catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Maybe document the postgres_copy speed difference? #149

Closed karldw closed 6 years ago

karldw commented 6 years ago

Mostly because I'm curious, but also in case pandas' to_sql dramatically improves in the future. (For new users, it would be nice if all the packages were in Anaconda or at least conda-forge. sqlalchemy-postgres-copy is not.)

Here are other approaches that could be thrown into the comparison:

zaneselvans commented 6 years ago

It seems like postgres-copy ought to be something that's eventually made available via sqlalchemy -- we're certainly not the only people trying to pull in big tables quickly, and I think many databases have this kind of low-level, no-validation text-slurping functionality. It might be worth poking around to see if it's been added. I remember looking at odo when we were trying to speed this process up, and don't remember why I ended up not using it. It should be pretty easy to test to_sql & postgres copy side by side -- if you run the extract & transform steps (once @cmgosnell checks in her re-org, if it's not all in already) you can take the prepared dataframe and try to load it using the CSV dump loader, and to_sql on one of the big EIA 923 tables.

My guess is the dbfread package will never be part of a standard python distribution, and we need it to jailbreak FERC Form 1, so I think there'll always need to be a bit of additional package installation -- have you done any python packaging work? Eventually it'd be nice if we could codify these dependencies and give folks a python setup.py install option that just grabs all these requirements and puts them where they need to be. Is there a standard way to pull dependencies using git? Or do they need to be available via e.g. pip?

karldw commented 6 years ago

Seems easy enough to test!

I've looked at python packaging, but never done it in a real project. I'm guessing the easiest way is to have a good setup.py, and eventually publish PUDL on PyPI, with dbfread listed as a dependency. I don't know about pulling dependencies with git, but that sounds harder.

karldw commented 6 years ago

FYI: pandas 0.24.0 will have more direct means to COPY data: https://github.com/pandas-dev/pandas/pull/21401