NREL / gdx-pandas

Python interface to read and write GAMS GDX files using pandas.DataFrames as the intermediate data format.
BSD 3-Clause "New" or "Revised" License
43 stars 16 forks source link

Speedups #85

Closed jebob closed 3 years ago

jebob commented 3 years ago

General speedups:

Tests:

Test design Extract gdx from [input_gdx.zip](https://github.com/NREL/gdx-pandas/files/5593039/input_gdx.zip) ```python import gdxpds import profilehooks @profilehooks.profile def read_big(): x = gdxpds.to_dataframes("big.gdx") # 1 symbol, 3000x3000 = 9 million elements @profilehooks.profile def roundtrip_many(): x = gdxpds.to_dataframes("many.gdx") # 1024 symbols, 10x10 = 100 elements each gdxpds.to_gdx(x, "many_out.gdx") read_big() roundtrip_many() ```
Test time before 98.4 seconds ``` *** PROFILER RESULTS *** roundtrip_many (E:/Projects/gdx-pandas playground/speed.py:10) function called 1 times 41469697 function calls (40927215 primitive calls) in 54.715 seconds Ordered by: cumulative time, internal time, call count List reduced from 924 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 54.715 54.715 speed.py:10(roundtrip_many) 1 0.000 0.000 27.737 27.737 write_gdx.py:143(to_gdx) 1 0.000 0.000 27.737 27.737 write_gdx.py:94(save_gdx) 1 0.000 0.000 26.978 26.978 read_gdx.py:105(to_dataframes) 1 0.000 0.000 26.781 26.781 read_gdx.py:49(__init__) 1 0.019 0.019 26.693 26.693 gdx.py:223(read) 2168 0.020 0.000 25.468 0.012 _collections_abc.py:966(append) 2168 0.262 0.000 25.444 0.012 gdx.py:327(insert) 2168 0.540 0.000 25.174 0.012 gdx.py:330() 1180480 0.673 0.000 24.592 0.000 _collections_abc.py:879(__iter__) 1181564 0.530 0.000 23.922 0.000 gdx.py:303(__getitem__) 1181564 23.144 0.000 23.392 0.000 gdx.py:348(_name_key) 2 0.003 0.001 20.209 10.104 write_gdx.py:86(gdx) 1084 0.013 0.000 20.106 0.019 write_gdx.py:99(__add_symbol_to_gdx) 23851 0.129 0.000 17.461 0.001 frame.py:334(__init__) 6507 0.032 0.000 16.762 0.003 gdx.py:636(dims) 6507 0.123 0.000 16.604 0.003 frame.py:426(_init_dict) 5423 0.048 0.000 16.296 0.003 gdx.py:732(_init_dataframe) 2171 0.030 0.000 9.823 0.005 gdx.py:449(__init__) 4336 0.095 0.000 9.812 0.002 gdx.py:666(dataframe) 1 0.004 0.004 7.528 7.528 gdx.py:266(write) 1085 0.451 0.000 7.370 0.007 gdx.py:827(write) 1084 0.231 0.000 7.146 0.007 gdx.py:794(load) 133399 0.115 0.000 5.105 0.000 base.py:4914(_ensure_index) 7591 0.024 0.000 4.767 0.001 frame.py:7349(_arrays_to_mgr) 5423 0.022 0.000 4.546 0.001 indexing.py:182(__setitem__) 1084 0.028 0.000 4.372 0.004 special.py:114(convert_np_to_gdx_svs) 40137/34713 0.268 0.000 4.123 0.000 series.py:166(__init__) 16270 0.327 0.000 3.850 0.000 {pandas._libs.lib.clean_index_list} 5423 0.018 0.000 3.395 0.001 indexing.py:152(_get_setitem_indexer) 5423 0.047 0.000 3.358 0.001 indexing.py:1225(_convert_to_indexer) 65081/16271 0.350 0.000 3.145 0.000 :966(_find_and_load) 7603 0.016 0.000 2.927 0.000 base.py:3071(get_loc) 7603 0.038 0.000 2.911 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects} 5423 0.009 0.000 2.874 0.001 base.py:52(__str__) 5423 0.058 0.000 2.865 0.001 series.py:1221(__unicode__) 65081/16271 0.132 0.000 2.738 0.000 :936(_find_and_load_unlocked) 48811/16271 0.020 0.000 2.702 0.000 :211(_call_with_frames_removed) 48810/16270 0.047 0.000 2.695 0.000 {built-in method builtins.__import__} 7115595/7115117 1.404 0.000 2.465 0.000 {built-in method builtins.isinstance} *** PROFILER RESULTS *** read_big (E:/Projects/gdx-pandas playground/speed.py:6) function called 1 times 99027929 function calls (99026152 primitive calls) in 43.780 seconds Ordered by: cumulative time, internal time, call count List reduced from 786 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 43.780 43.780 speed.py:6(read_big) 1 0.000 0.000 43.780 43.780 read_gdx.py:105(to_dataframes) 1 0.000 0.000 43.548 43.548 read_gdx.py:49(__init__) 1 0.365 0.365 43.457 43.457 gdx.py:223(read) 1 20.183 20.183 43.080 43.080 gdx.py:794(load) 9000009 4.903 0.000 8.712 0.000 gdx.py:580(value_cols) 9000000 1.993 0.000 6.096 0.000 gdxcc.py:426(gdxDataReadStr) 9000000 4.102 0.000 4.102 0.000 {built-in method _gdxcc.gdxDataReadStr} 9000011 2.050 0.000 2.975 0.000 enum.py:579(__hash__) 2 0.171 0.086 1.800 0.900 gdx.py:666(dataframe) 18000034 1.675 0.000 1.675 0.000 gdx.py:516(data_type) 9000000 1.470 0.000 1.470 0.000 gdx.py:813() 13 0.000 0.000 1.240 0.095 frame.py:334(__init__) 1 0.338 0.338 1.206 1.206 special.py:52(convert_gdx_to_np_svs) 9000011 0.997 0.000 0.997 0.000 gdx.py:194(H) 25 0.981 0.039 0.981 0.039 {method 'copy' of 'numpy.ndarray' objects} 1 0.000 0.000 0.927 0.927 frame.py:7453(_to_arrays) 1 0.000 0.000 0.927 0.927 frame.py:7547(_list_to_arrays) 9000030 0.924 0.000 0.924 0.000 {built-in method builtins.hash} 9000008 0.913 0.000 0.913 0.000 gdx.py:612(file) 9000122 0.863 0.000 0.863 0.000 {method 'append' of 'list' objects} 9 0.000 0.000 0.747 0.083 internals.py:774(copy) 12 0.000 0.000 0.712 0.059 internals.py:3500(apply) 5 0.000 0.000 0.711 0.142 internals.py:3895(copy) 3 0.000 0.000 0.711 0.237 generic.py:5009(copy) 1 0.000 0.000 0.474 0.474 frame.py:7604(_convert_object_array) 1 0.000 0.000 0.474 0.474 frame.py:7621() 3 0.000 0.000 0.474 0.158 frame.py:7615(convert) 8 0.474 0.059 0.474 0.059 {pandas._libs.lib.maybe_convert_objects} 1 0.453 0.453 0.453 0.453 {pandas._libs.lib.to_object_array} 2 0.000 0.000 0.308 0.154 indexing.py:1463(__getitem__) 2 0.000 0.000 0.308 0.154 indexing.py:2011(_getitem_tuple) 2 0.000 0.000 0.308 0.154 indexing.py:2075(_getitem_axis) 2 0.000 0.000 0.308 0.154 indexing.py:2040(_get_slice_axis) 2 0.000 0.000 0.308 0.154 indexing.py:147(_slice) 2 0.000 0.000 0.308 0.154 generic.py:2583(_slice) 2 0.000 0.000 0.308 0.154 internals.py:3869(get_slice) 2 0.000 0.000 0.307 0.154 internals.py:4431(_slice_take_blocks_ax0) 2 0.000 0.000 0.307 0.154 internals.py:1237(take_nd) 2 0.000 0.000 0.307 0.154 algorithms.py:1548(take_nd) Process finished with exit code 0 ```
Test time after 48.4 seconds ``` C:\Python36\python.exe "E:/Projects/gdx-pandas playground/speed.py" *** PROFILER RESULTS *** roundtrip_many (E:/Projects/gdx-pandas playground/speed.py:10) function called 1 times 33560833 function calls (33016183 primitive calls) in 28.190 seconds Ordered by: cumulative time, internal time, call count List reduced from 923 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 28.205 28.205 speed.py:10(roundtrip_many) 23851 0.123 0.000 17.015 0.001 frame.py:334(__init__) 6507 0.031 0.000 16.293 0.003 gdx.py:641(dims) 6507 0.118 0.000 16.150 0.002 frame.py:426(_init_dict) 5423 0.046 0.000 15.854 0.003 gdx.py:737(_init_dataframe) 1 0.000 0.000 14.471 14.471 write_gdx.py:143(to_gdx) 1 0.000 0.000 14.470 14.470 write_gdx.py:94(save_gdx) 1 0.000 0.000 13.734 13.734 read_gdx.py:105(to_dataframes) 1 0.000 0.000 13.522 13.522 read_gdx.py:49(__init__) 1 0.016 0.016 13.423 13.423 gdx.py:223(read) 4336 0.090 0.000 9.570 0.002 gdx.py:671(dataframe) 2171 0.025 0.000 9.484 0.004 gdx.py:454(__init__) 1 0.005 0.005 7.273 7.273 gdx.py:266(write) 2 0.002 0.001 7.197 3.598 write_gdx.py:86(gdx) 1085 0.449 0.000 7.114 0.007 gdx.py:833(write) 1084 0.012 0.000 7.106 0.007 write_gdx.py:99(__add_symbol_to_gdx) 1084 0.021 0.000 6.913 0.006 gdx.py:799(load) 133399 0.115 0.000 4.850 0.000 base.py:4914(_ensure_index) 7591 0.023 0.000 4.680 0.001 frame.py:7349(_arrays_to_mgr) 5423 0.022 0.000 4.427 0.001 indexing.py:182(__setitem__) 1084 0.025 0.000 4.187 0.004 special.py:114(convert_np_to_gdx_svs) 40137/34713 0.264 0.000 3.934 0.000 series.py:166(__init__) 16270 0.298 0.000 3.628 0.000 {pandas._libs.lib.clean_index_list} 5423 0.017 0.000 3.302 0.001 indexing.py:152(_get_setitem_indexer) 5423 0.045 0.000 3.265 0.001 indexing.py:1225(_convert_to_indexer) 65081/16271 0.350 0.000 2.972 0.000 :966(_find_and_load) 7603 0.013 0.000 2.845 0.000 base.py:3071(get_loc) 7603 0.036 0.000 2.832 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects} 5423 0.008 0.000 2.796 0.001 base.py:52(__str__) 5423 0.054 0.000 2.788 0.001 series.py:1221(__unicode__) 65081/16271 0.121 0.000 2.584 0.000 :936(_find_and_load_unlocked) 48811/16271 0.019 0.000 2.554 0.000 :211(_call_with_frames_removed) 48810/16270 0.045 0.000 2.548 0.000 {built-in method builtins.__import__} 5423 0.029 0.000 2.398 0.000 series.py:1240(to_string) 1084 0.005 0.000 2.288 0.002 frame.py:6016(applymap) 1084 0.005 0.000 2.282 0.002 frame.py:5837(apply) 5939455/5938977 1.226 0.000 2.266 0.000 {built-in method builtins.isinstance} 1084 0.002 0.000 2.265 0.002 apply.py:311(get_result) 1084 0.007 0.000 2.262 0.002 apply.py:105(get_result) 1084 0.025 0.000 2.248 0.002 apply.py:219(apply_standard) *** PROFILER RESULTS *** read_big (E:/Projects/gdx-pandas playground/speed.py:6) function called 1 times 27027934 function calls (27026156 primitive calls) in 20.224 seconds Ordered by: cumulative time, internal time, call count List reduced from 785 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 21.506 21.506 speed.py:6(read_big) 1 0.000 0.000 21.506 21.506 read_gdx.py:105(to_dataframes) 1 0.000 0.000 21.260 21.260 read_gdx.py:49(__init__) 1 0.439 0.439 21.166 21.166 gdx.py:223(read) 1 0.167 0.167 20.717 20.717 gdx.py:799(load) 1 8.028 8.028 17.441 17.441 gdx.py:821() 9000001 2.512 0.000 8.132 0.000 gdx.py:815(reader) 9000000 1.507 0.000 5.619 0.000 gdxcc.py:426(gdxDataReadStr) 9000000 4.112 0.000 4.112 0.000 {built-in method _gdxcc.gdxDataReadStr} 2 0.166 0.083 1.777 0.889 gdx.py:671(dataframe) 1 0.410 0.410 1.331 1.331 special.py:52(convert_gdx_to_np_svs) 13 0.000 0.000 1.230 0.095 frame.py:334(__init__) 25 1.055 0.042 1.055 0.042 {method 'copy' of 'numpy.ndarray' objects} 1 0.000 0.000 0.929 0.929 frame.py:7453(_to_arrays) 1 0.000 0.000 0.929 0.929 frame.py:7547(_list_to_arrays) 9 0.000 0.000 0.743 0.083 internals.py:774(copy) 12 0.000 0.000 0.715 0.060 internals.py:3500(apply) 5 0.000 0.000 0.714 0.143 internals.py:3895(copy) 3 0.000 0.000 0.713 0.238 generic.py:5009(copy) 1 0.468 0.468 0.468 0.468 {pandas._libs.lib.to_object_array} 1 0.000 0.000 0.461 0.461 frame.py:7604(_convert_object_array) 1 0.000 0.000 0.461 0.461 frame.py:7621() 3 0.000 0.000 0.461 0.154 frame.py:7615(convert) 8 0.461 0.058 0.461 0.058 {pandas._libs.lib.maybe_convert_objects} 1 0.000 0.000 0.312 0.312 frame.py:6379(merge) 1 0.000 0.000 0.312 0.312 merge.py:51(merge) 1 0.000 0.000 0.312 0.312 merge.py:563(get_result) 1 0.000 0.000 0.312 0.312 internals.py:5388(concatenate_block_managers) 2 0.000 0.000 0.302 0.151 indexing.py:1463(__getitem__) 2 0.000 0.000 0.302 0.151 indexing.py:2011(_getitem_tuple) 2 0.000 0.000 0.302 0.151 indexing.py:2075(_getitem_axis) 2 0.000 0.000 0.302 0.151 indexing.py:2040(_get_slice_axis) 2 0.000 0.000 0.302 0.151 indexing.py:147(_slice) 2 0.000 0.000 0.302 0.151 generic.py:2583(_slice) 2 0.000 0.000 0.302 0.151 internals.py:3869(get_slice) 2 0.000 0.000 0.301 0.151 internals.py:4431(_slice_take_blocks_ax0) 2 0.000 0.000 0.301 0.151 internals.py:1237(take_nd) 2 0.000 0.000 0.301 0.151 algorithms.py:1548(take_nd) 6 0.000 0.000 0.287 0.048 frame.py:7349(_arrays_to_mgr) 6 0.000 0.000 0.284 0.047 internals.py:4869(create_block_manager_from_arrays) Process finished with exit code 0 ```
elainethale commented 3 years ago

Thanks, @jebob. I probably can't find time to test and release today, but will try to in the next couple of weeks.