assume-framework / assume

ASSUME - Agent-based Simulation for Studying and Understanding Market Evolution
20 stars 5 forks source link

Improve performance by switching from pandas to numpy #321

Open maurerle opened 4 months ago

maurerle commented 4 months ago

During my recent studies I found ASSUME to be very slow when simulating a whole year. One way to improve this is by switching towards daily market clearing instead of hourly, but it still takes a while. When looking into the code which takes a lot of time I found pandas to often be the case:

Act 1: Profiling Benchmarks

due to various reasons, cProfile does not give good timings when running async code. More correct timings can be seen using yappi (pip install yappi) - So one can run yappi -o "out.profile" and then use tuna (pip install tuna) to visualize the profiling result: tuna out.profile. This gives theses visual charts like shown below. The results are therefore equally to running assume -s example01a -c base. This run takes 88s on my laptop. Probably ~20s are spent organizing asyncio-stuff ~60s is spent in pandas ~3s on imports ~ rest on other stuff

calculate_bids boils down to take time in pandas image

handling market_feedback spends a lot of time in pandas too - nearly all the site-packages stuff is spend in pandas image

writing outputs spends most of its time in pandas too image

Though one can not see that much due to the long lines - I could not find a way to remove the absolute paths from the pictures..

Act 2: Alternatives

So I thought how one can replace pandas. Our requirements includes slicing, indexing by datetime and having multiple series. After experimenting with modin and dask I could not use modin as a drop in replacement and dask did not seem like a good solution either, as we spend a lot of time in the initialization of dataframes and not in the heavy lifting.

I came up with good old numpy, which supports slicing. But can only have an array with the same types. So a datetime index is not possible.

I thought about having a convenience wrapper - something like this:

def idx_from_date(date):
    return (date-start)//freq

def numpy_dt_indexer(data, fr, to):
    from_idx = idx_from_date(fr)
    to_idx = idx_from_date(to)
    return data[from_idx:to_idx]

After all, it turns out, that switching to numpy is at least 40x faster than pandas. I really hope, that this is also the case when switching the main parts of the simulation to it.

Act 3

Implementation TBD