Klimatbyran / klimatkollen

https://www.klimatkollen.se
MIT License
45 stars 48 forks source link

Speed-up data processing #682

Open joakimbits opened 1 month ago

joakimbits commented 1 month ago

Support pytest-profiling of data tests to find and fix performance bottlenecks

The data may need some flattening to numeric ndtypes before processing.

Definition of Done: Can run pytest and plot a heat map of time spent in functions.

Not urgent, but a good introduction for Joppe on the data processing pipelines.

Contact Joppe for questions/discussions/suggestions

In addition to the Definition of Done, the following always apply:

joakimbits commented 1 month ago

Tested on mac in local branch where pytest-profiling is added to requirements:

brew install graphviz
py.test tests --profile-svg && open prof/combined.svg

combined

joakimbits commented 1 month ago

read_excel in get_smhi_data is dominating over everything.

joakimbits commented 1 month ago

According to https://hakibenita.com/fast-excel-python we get 10x faster excel reads with python-calamine.

joakimbits commented 1 month ago

Adding the python-calamine pip to requirements, we can use it in pandas pd.read_excel("path_to_file.xlsb", engine="calamine")

joakimbits commented 1 month ago

Changing read_excel engine='calamine' in get_smhi_data cut test suite by 6 seconds, from 15 to 9 seconds. But it is still dominating the test time.

joakimbits commented 1 month ago

According to https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d we will get an order of magnitude faster loading time if we cashe it also. https://miro.medium.com/v2/resize:fit:720/format:webp/1*-QoJbusw3MUYdms0lbmd4Q.png

joakimbits commented 1 month ago

Decorated get_smhi_data with a file cache that reads from cache if it is from the current year. Now the whole suite completes in less than a second!

joakimbits commented 1 month ago

Actually just 0.2 seconds - so fast now that the profiler has nothing to show - just an empty .csv file.

joakimbits commented 1 month ago

https://github.com/Klimatbyran/klimatkollen/pull/683