Speed-up data processing

Klimatbyran / klimatkollen

https://www.klimatkollen.se

MIT License

45 stars 48 forks source link

Open joakimbits opened 2 months ago

joakimbits commented 2 months ago

Support pytest-profiling of data tests to find and fix performance bottlenecks

The data may need some flattening to numeric ndtypes before processing.

Definition of Done: Can run pytest and plot a heat map of time spent in functions.

Not urgent, but a good introduction for Joppe on the data processing pipelines.

Contact Joppe for questions/discussions/suggestions

In addition to the Definition of Done, the following always apply:

joakimbits commented 2 months ago

Tested on mac in local branch where pytest-profiling is added to requirements:

brew install graphviz
py.test tests --profile-svg && open prof/combined.svg

combined

joakimbits commented 2 months ago

read_excel in get_smhi_data is dominating over everything.

joakimbits commented 2 months ago

According to https://hakibenita.com/fast-excel-python we get 10x faster excel reads with python-calamine.

joakimbits commented 2 months ago

Adding the python-calamine pip to requirements, we can use it in pandas pd.read_excel("path_to_file.xlsb", engine="calamine")

joakimbits commented 2 months ago

Changing read_excel engine='calamine' in get_smhi_data cut test suite by 6 seconds, from 15 to 9 seconds. But it is still dominating the test time.

joakimbits commented 2 months ago

joakimbits commented 2 months ago

Decorated get_smhi_data with a file cache that reads from cache if it is from the current year. Now the whole suite completes in less than a second!

joakimbits commented 2 months ago

Actually just 0.2 seconds - so fast now that the profiler has nothing to show - just an empty .csv file.

joakimbits commented 1 month ago