During my recent studies I found ASSUME to be very slow when simulating a whole year.
One way to improve this is by switching towards daily market clearing instead of hourly, but it still takes a while.
When looking into the code which takes a lot of time I found pandas to often be the case:
Act 1: Profiling Benchmarks
due to various reasons, cProfile does not give good timings when running async code.
More correct timings can be seen using yappi (pip install yappi) - https://github.com/sumerc/yappi
So one can run yappi -o "out.profile" cli.py and then use tuna (pip install tuna) to visualize the profiling result:
tuna out.profile.
This gives theses visual charts like shown below.
The results are therefore equally to running assume -s example01a -c base. This run takes 88s on my laptop.
Probably ~20s are spent organizing asyncio-stuff
~60s is spent in pandas
~3s on imports
~ rest on other stuff
calculate_bids boils down to take time in pandas
handling market_feedback spends a lot of time in pandas too - nearly all the site-packages stuff is spend in pandas
writing outputs spends most of its time in pandas too
Though one can not see that much due to the long lines - I could not find a way to remove the absolute paths from the pictures..
Act 2: Alternatives
So I thought how one can replace pandas.
Our requirements includes slicing, indexing by datetime and having multiple series.
After experimenting with modin and dask
I could not use modin as a drop in replacement and dask did not seem like a good solution either, as we spend a lot of time in the initialization of dataframes and not in the heavy lifting.
I came up with good old numpy, which supports slicing. But can only have an array with the same types.
So a datetime index is not possible.
I thought about having a convenience wrapper - something like this:
After all, it turns out, that switching to numpy is at least 40x faster than pandas.
I really hope, that this is also the case when switching the main parts of the simulation to it.
During my recent studies I found ASSUME to be very slow when simulating a whole year. One way to improve this is by switching towards daily market clearing instead of hourly, but it still takes a while. When looking into the code which takes a lot of time I found pandas to often be the case:
Act 1: Profiling Benchmarks
due to various reasons, cProfile does not give good timings when running async code. More correct timings can be seen using
yappi
(pip install yappi) - https://github.com/sumerc/yappi So one can runyappi -o "out.profile" cli.py
and then usetuna
(pip install tuna) to visualize the profiling result:tuna out.profile
. This gives theses visual charts like shown below. The results are therefore equally to runningassume -s example01a -c base
. This run takes 88s on my laptop. Probably ~20s are spent organizing asyncio-stuff ~60s is spent in pandas ~3s on imports ~ rest on other stuffcalculate_bids boils down to take time in pandas
handling market_feedback spends a lot of time in pandas too - nearly all the site-packages stuff is spend in pandas
writing outputs spends most of its time in pandas too
Though one can not see that much due to the long lines - I could not find a way to remove the absolute paths from the pictures..
Act 2: Alternatives
So I thought how one can replace pandas. Our requirements includes slicing, indexing by datetime and having multiple series. After experimenting with modin and dask I could not use modin as a drop in replacement and dask did not seem like a good solution either, as we spend a lot of time in the initialization of dataframes and not in the heavy lifting.
I came up with good old numpy, which supports slicing. But can only have an array with the same types. So a datetime index is not possible.
I thought about having a convenience wrapper - something like this:
After all, it turns out, that switching to numpy is at least 40x faster than pandas. I really hope, that this is also the case when switching the main parts of the simulation to it.
Act 3
Implementation TBD