UDST / orca

Python library for task orchestration
https://udst.github.io/orca/
BSD 3-Clause "New" or "Revised" License
53 stars 21 forks source link

Allow the "run" function to store local columns only #23

Closed hanase closed 7 years ago

hanase commented 7 years ago

This addresses the issue discussed here (by @janowicz on Feb 21st). Currently the run function writes out all variables regardless if they are used or not which can result in huge files. This happens for both, the base year data as well as every orca iteration, because the write_tables() function simply calls to_frame().

This PR adds two boolean arguments to the run function (out_base_local and out_run_local). If True, only local columns are stored for out_base_tables and out_run_tables, respectively. The write_tables function gets a boolean argument called "local". (Note that after PR #22 of @bridwell is accepted, the write_tables function can pass an expression to to_frame() for obtaining local columns only.)

In our case this change reduces the output file size more than 4 times.

Please feel free to rename the new arguments if needed.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.005%) to 96.536% when pulling c80130b42baf619c4834567ae7bf5e696378fc66 on hanase:write_local_columns_only into 8e7b87b0c57cd1c29d768906776faa3cfafb67e7 on UDST:master.

hanase commented 7 years ago

Note that as suggested by Eddie, the default behavior now is to save local columns only in the base year, and all columns in run years. This changes the original behavior of the run function which saved all columns in the base year.

janowicz commented 7 years ago

This is great, thanks @hanase! Appreciate the adding of docstrings for certain of the run parameters that weren't documented before too.

Any thoughts from people about the appropriate defaults for out_base_local or out_run_local? It may make sense to keep the defaults for both of these as False for the moment so as to not break compatibility for anyone who is using computed base-year variables that come from using data_out. For me, I'll usually set them both to True when using :).

bridwell commented 7 years ago

I kind of like idea of setting them both to True by default, but am OK with whatever.

hanase commented 7 years ago

I agree with @bridwell that it would make sense to set both arguments to True by default. Having either of them False might be useful for debugging and exploration, which should not be the default. But I'm also OK with whatever you guys decide.

janowicz commented 7 years ago

Agreed that the most sensible default is setting both arguments to True. Lets go with that. @hanase could you please set the default for out_run_local to True in this PR?

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.005%) to 96.536% when pulling 580a6827990671f6ae247ad856257aa840674835 on hanase:write_local_columns_only into 8e7b87b0c57cd1c29d768906776faa3cfafb67e7 on UDST:master.

hanase commented 7 years ago

Done. Sorry - my commit message is wrong - the argument is set to True now.