NREL / gdx-pandas

Python interface to read and write GAMS GDX files using pandas.DataFrames as the intermediate data format.
BSD 3-Clause "New" or "Revised" License
43 stars 15 forks source link

Move specials to new module, speedup #64

Closed jebob closed 4 years ago

jebob commented 5 years ago

Move code handling special values into a separate file out of the GDX class. We now precalculate SPECIAL_VALUES and NUMPY_SPECIAL_VALUES (lists of special values), GDX_TO_NP_SVS and NP_TO_GDX_SVS (dictionaries which do conversions between the GDX world and the Python world and back).

I also tried to speedup the convert_np_to_gdx_svs/convert_gdx_to_np_svs functions and I will verify this in a bit.

jebob commented 5 years ago

@elainethale I think we can kickout is_np_sv as it's no longer used in anything, what do you think?

jebob commented 5 years ago
Test code ```python import gdxpds import pandas as pd import profilehooks n = 100000 @profilehooks.profile def write(): df = pd.DataFrame({"A": list(range(n)), "value": list(range(n))}) gdxpds.to_gdx({"df": df}, "test.gdx") @profilehooks.profile def read(): df = gdxpds.to_dataframe("test.gdx", "df") for i in range(10): write() read() ```
Original version, total round trip time 22.986 seconds ``` *** PROFILER RESULTS *** read function called 10 times 13234090 function calls (13229800 primitive calls) in 5.911 seconds Ordered by: cumulative time, internal time, call count List reduced from 749 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 10 0.000 0.000 5.911 0.591 speed.py:13(read) 10 0.000 0.000 5.911 0.591 read_gdx.py:130(to_dataframe) 10 0.029 0.003 5.624 0.562 read_gdx.py:97(dataframe) 10 2.216 0.222 5.580 0.558 gdx.py:961(load) 1000090 0.559 0.000 0.976 0.000 gdx.py:751(value_cols) 10 0.014 0.001 0.935 0.094 gdx.py:86(convert_gdx_to_np_svs) 10 0.001 0.000 0.861 0.086 frame.py:6016(applymap) 10 0.000 0.000 0.860 0.086 frame.py:5837(apply) 10 0.000 0.000 0.860 0.086 apply.py:311(get_result) 10 0.000 0.000 0.860 0.086 apply.py:105(get_result) 10 0.000 0.000 0.860 0.086 apply.py:219(apply_standard) 20 0.020 0.001 0.836 0.042 frame.py:6067(infer) 80 0.470 0.006 0.759 0.009 {pandas._libs.lib.map_infer} 1000000 0.225 0.000 0.713 0.000 gdxcc.py:481(gdxDataReadStr) 1000000 0.488 0.000 0.488 0.000 {built-in method _gdxcc.gdxDataReadStr} 10 0.002 0.000 0.423 0.042 {pandas._libs.reduction.reduce} 10 0.000 0.000 0.416 0.042 apply.py:253(apply_series_generator) 1000090 0.224 0.000 0.317 0.000 enum.py:579(__hash__) 2000000 0.289 0.000 0.289 0.000 gdx.py:108(to_np_svs) 130 0.001 0.000 0.283 0.002 frame.py:334(__init__) 10 0.000 0.000 0.240 0.024 read_gdx.py:49(__init__) 2000340 0.196 0.000 0.196 0.000 gdx.py:687(data_type) 10 0.000 0.000 0.192 0.019 gdx.py:390(read) 20 0.011 0.001 0.191 0.010 gdx.py:835(dataframe) 50 0.000 0.000 0.184 0.004 gdx.py:806(dims) 50 0.007 0.000 0.184 0.004 gdx.py:902(_init_dataframe) 1000000 0.172 0.000 0.172 0.000 gdx.py:980() 60 0.001 0.000 0.161 0.003 frame.py:426(_init_dict) 30 0.000 0.000 0.148 0.005 gdx.py:620(__init__) 1000110 0.110 0.000 0.110 0.000 gdx.py:361(H) 150 0.002 0.000 0.107 0.001 internals.py:3500(apply) 1000090 0.102 0.000 0.102 0.000 gdx.py:782(file) 10 0.000 0.000 0.100 0.010 frame.py:7453(_to_arrays) 10 0.000 0.000 0.100 0.010 frame.py:7547(_list_to_arrays) 1000330 0.093 0.000 0.093 0.000 {built-in method builtins.hash} 10 0.000 0.000 0.086 0.009 gdxcc.py:589(gdxOpenRead) 10 0.086 0.009 0.086 0.009 {built-in method _gdxcc.gdxOpenRead} 1001030 0.070 0.000 0.070 0.000 {method 'append' of 'list' objects} 50 0.000 0.000 0.062 0.001 _decorators.py:136(wrapper) 50 0.000 0.000 0.062 0.001 generic.py:4890(astype) *** PROFILER RESULTS *** write function called 10 times 26192195 function calls (26186079 primitive calls) in 17.075 seconds Ordered by: cumulative time, internal time, call count List reduced from 843 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 10 0.043 0.004 17.075 1.708 speed.py:7(write) 10 0.000 0.000 16.581 1.658 write_gdx.py:143(to_gdx) 10 0.000 0.000 16.581 1.658 write_gdx.py:94(save_gdx) 10 0.001 0.000 16.415 1.642 gdx.py:433(write) 20 3.552 0.178 16.308 0.815 gdx.py:990(write) 10 0.004 0.000 6.215 0.621 gdx.py:153(convert_np_to_gdx_svs) 10 0.001 0.000 6.175 0.618 frame.py:6016(applymap) 10 0.000 0.000 6.175 0.617 frame.py:5837(apply) 10 0.000 0.000 6.174 0.617 apply.py:311(get_result) 10 0.000 0.000 6.174 0.617 apply.py:105(get_result) 10 0.000 0.000 6.174 0.617 apply.py:219(apply_standard) 20 0.015 0.001 6.149 0.307 frame.py:6067(infer) 60 0.923 0.015 6.082 0.101 {pandas._libs.lib.map_infer} 2000000 1.913 0.000 5.159 0.000 gdx.py:175(to_gdx_svs) 2000000 3.246 0.000 3.246 0.000 gdx.py:125(is_np_eps) 10 0.000 0.000 3.084 0.308 apply.py:253(apply_series_generator) 10 0.002 0.000 3.070 0.307 {pandas._libs.reduction.reduce} 1037147/1036114 0.446 0.000 1.253 0.000 {built-in method builtins.isinstance} 2000230 0.842 0.000 1.208 0.000 gdx.py:827(num_dims) 1000000 0.250 0.000 1.183 0.000 gdxcc.py:513(gdxDataWriteStr) 1000090 0.657 0.000 1.163 0.000 gdx.py:751(value_cols) 2000000 0.489 0.000 0.984 0.000 gdxcc.py:152(__setitem__) 1000000 0.933 0.000 0.933 0.000 {built-in method _gdxcc.gdxDataWriteStr} 1000720 0.468 0.000 0.802 0.000 abc.py:178(__instancecheck__) 130 0.001 0.000 0.561 0.004 frame.py:334(__init__) 50 0.003 0.000 0.559 0.011 frame.py:426(_init_dict) 2000000 0.495 0.000 0.495 0.000 {built-in method _gdxcc.doubleArray___setitem__} 50 0.000 0.000 0.474 0.009 frame.py:7349(_arrays_to_mgr) 230 0.001 0.000 0.445 0.002 series.py:4019(_sanitize_array) 50 0.000 0.000 0.440 0.009 frame.py:7644(_homogenize) 20 0.027 0.001 0.434 0.022 cast.py:44(maybe_convert_platform) 1000102 0.269 0.000 0.394 0.000 enum.py:579(__hash__) 50 0.384 0.008 0.384 0.008 {pandas._libs.lib.maybe_convert_objects} 1000000 0.373 0.000 0.373 0.000 gdx.py:1031() 1001832 0.329 0.000 0.329 0.000 _weakrefset.py:70(__contains__) 2000300 0.206 0.000 0.206 0.000 gdx.py:802(dims) 20 0.000 0.000 0.166 0.008 write_gdx.py:86(gdx) 2011684/2009524 0.164 0.000 0.166 0.000 {built-in method builtins.len} 1000110 0.130 0.000 0.130 0.000 gdx.py:361(H) 1000362 0.124 0.000 0.124 0.000 {built-in method builtins.hash} ```
New version, total round trip 20.69 seconds ``` *** PROFILER RESULTS *** read function called 10 times 11223430 function calls (11219280 primitive calls) in 5.140 seconds Ordered by: cumulative time, internal time, call count List reduced from 708 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 10 0.000 0.000 5.140 0.514 speed.py:13(read) 10 0.000 0.000 5.140 0.514 read_gdx.py:130(to_dataframe) 10 0.032 0.003 4.853 0.485 read_gdx.py:97(dataframe) 10 2.232 0.223 4.804 0.480 gdx.py:780(load) 1000090 0.589 0.000 1.027 0.000 gdx.py:570(value_cols) 1000000 0.229 0.000 0.710 0.000 gdxcc.py:481(gdxDataReadStr) 1000000 0.481 0.000 0.481 0.000 {built-in method _gdxcc.gdxDataReadStr} 1000090 0.237 0.000 0.336 0.000 enum.py:579(__hash__) 130 0.001 0.000 0.279 0.002 frame.py:334(__init__) 10 0.000 0.000 0.245 0.024 read_gdx.py:49(__init__) 2000340 0.200 0.000 0.200 0.000 gdx.py:506(data_type) 20 0.010 0.001 0.194 0.010 gdx.py:654(dataframe) 10 0.001 0.000 0.192 0.019 gdx.py:214(read) 50 0.000 0.000 0.191 0.004 gdx.py:625(dims) 50 0.007 0.000 0.190 0.004 gdx.py:721(_init_dataframe) 1000000 0.161 0.000 0.161 0.000 gdx.py:799() 50 0.001 0.000 0.158 0.003 frame.py:426(_init_dict) 30 0.000 0.000 0.153 0.005 gdx.py:439(__init__) 1000100 0.111 0.000 0.111 0.000 gdx.py:185(H) 1000080 0.104 0.000 0.104 0.000 gdx.py:601(file) 1000280 0.099 0.000 0.099 0.000 {built-in method builtins.hash} 10 0.000 0.000 0.098 0.010 frame.py:7453(_to_arrays) 10 0.000 0.000 0.098 0.010 frame.py:7547(_list_to_arrays) 10 0.016 0.002 0.093 0.009 special.py:52(convert_gdx_to_np_svs) 10 0.000 0.000 0.087 0.009 gdxcc.py:589(gdxOpenRead) 10 0.087 0.009 0.087 0.009 {built-in method _gdxcc.gdxOpenRead} 1001000 0.074 0.000 0.074 0.000 {method 'append' of 'list' objects} 250 0.057 0.000 0.057 0.000 {method 'copy' of 'numpy.ndarray' objects} 120 0.001 0.000 0.056 0.000 internals.py:3500(apply) 60 0.000 0.000 0.056 0.001 frame.py:7349(_arrays_to_mgr) 10 0.000 0.000 0.055 0.006 frame.py:7604(_convert_object_array) 70 0.055 0.001 0.055 0.001 {pandas._libs.lib.maybe_convert_objects} 10 0.000 0.000 0.055 0.005 frame.py:7621() 20 0.000 0.000 0.055 0.003 frame.py:7615(convert) 10 0.000 0.000 0.053 0.005 gdx.py:133(__init__) 50 0.000 0.000 0.049 0.001 internals.py:3895(copy) 950 0.001 0.000 0.048 0.000 base.py:4914(_ensure_index) 30 0.000 0.000 0.047 0.002 generic.py:5009(copy) 90 0.000 0.000 0.046 0.001 internals.py:774(copy) 50 0.000 0.000 0.044 0.001 indexing.py:182(__setitem__) *** PROFILER RESULTS *** write function called 10 times 26196695 function calls (26190509 primitive calls) in 15.556 seconds Ordered by: cumulative time, internal time, call count List reduced from 847 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 10 0.049 0.005 15.556 1.556 speed.py:7(write) 10 0.000 0.000 15.061 1.506 write_gdx.py:143(to_gdx) 10 0.000 0.000 15.060 1.506 write_gdx.py:94(save_gdx) 10 0.000 0.000 14.891 1.489 gdx.py:257(write) 20 3.548 0.177 14.778 0.739 gdx.py:809(write) 10 0.003 0.000 4.675 0.467 special.py:114(convert_np_to_gdx_svs) 10 0.001 0.000 4.624 0.462 frame.py:6016(applymap) 10 0.000 0.000 4.624 0.462 frame.py:5837(apply) 10 0.000 0.000 4.623 0.462 apply.py:311(get_result) 10 0.000 0.000 4.623 0.462 apply.py:105(get_result) 10 0.000 0.000 4.622 0.462 apply.py:219(apply_standard) 20 0.015 0.001 4.597 0.230 frame.py:6067(infer) 60 0.863 0.014 4.528 0.075 {pandas._libs.lib.map_infer} 2000000 0.388 0.000 3.665 0.000 special.py:134(convert_approx_eps) 2000000 3.278 0.000 3.278 0.000 special.py:84(is_np_eps) 10 0.004 0.000 2.326 0.233 {pandas._libs.reduction.reduce} 10 0.000 0.000 2.277 0.228 apply.py:253(apply_series_generator) 2000230 0.870 0.000 1.254 0.000 gdx.py:646(num_dims) 1038727/1037694 0.442 0.000 1.209 0.000 {built-in method builtins.isinstance} 1000090 0.662 0.000 1.178 0.000 gdx.py:570(value_cols) 1000000 0.269 0.000 1.157 0.000 gdxcc.py:513(gdxDataWriteStr) 2000000 0.510 0.000 1.007 0.000 gdxcc.py:152(__setitem__) 1000000 0.888 0.000 0.888 0.000 {built-in method _gdxcc.gdxDataWriteStr} 1000730 0.439 0.000 0.763 0.000 abc.py:178(__instancecheck__) 130 0.001 0.000 0.560 0.004 frame.py:334(__init__) 50 0.003 0.000 0.558 0.011 frame.py:426(_init_dict) 2000000 0.497 0.000 0.497 0.000 {built-in method _gdxcc.doubleArray___setitem__} 50 0.000 0.000 0.472 0.009 frame.py:7349(_arrays_to_mgr) 230 0.001 0.000 0.439 0.002 series.py:4019(_sanitize_array) 50 0.000 0.000 0.435 0.009 frame.py:7644(_homogenize) 20 0.027 0.001 0.428 0.021 cast.py:44(maybe_convert_platform) 1000102 0.264 0.000 0.389 0.000 enum.py:579(__hash__) 50 0.376 0.008 0.376 0.008 {pandas._libs.lib.maybe_convert_objects} 1000000 0.367 0.000 0.367 0.000 gdx.py:850() 1001832 0.320 0.000 0.320 0.000 _weakrefset.py:70(__contains__) 2000300 0.225 0.000 0.225 0.000 gdx.py:621(dims) 20 0.000 0.000 0.170 0.008 write_gdx.py:86(gdx) 2011794/2009584 0.163 0.000 0.165 0.000 {built-in method builtins.len} 1000090 0.129 0.000 0.129 0.000 gdx.py:185(H) 1000260 0.127 0.000 0.127 0.000 gdx.py:506(data_type) ```

convert_gdx_to_np_svs is 10x faster, convert_np_to_gdx_svs is 25% faster.