Open joakimbits opened 1 month ago
Tested on mac in local branch where pytest-profiling is added to requirements:
brew install graphviz
py.test tests --profile-svg && open prof/combined.svg
read_excel in get_smhi_data is dominating over everything.
According to https://hakibenita.com/fast-excel-python we get 10x faster excel reads with python-calamine.
Adding the python-calamine pip to requirements, we can use it in pandas pd.read_excel("path_to_file.xlsb", engine="calamine")
Changing read_excel engine='calamine' in get_smhi_data cut test suite by 6 seconds, from 15 to 9 seconds. But it is still dominating the test time.
According to https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d we will get an order of magnitude faster loading time if we cashe it also. https://miro.medium.com/v2/resize:fit:720/format:webp/1*-QoJbusw3MUYdms0lbmd4Q.png
Decorated get_smhi_data with a file cache that reads from cache if it is from the current year. Now the whole suite completes in less than a second!
Actually just 0.2 seconds - so fast now that the profiler has nothing to show - just an empty .csv file.
Support pytest-profiling of data tests to find and fix performance bottlenecks
The data may need some flattening to numeric ndtypes before processing.
Definition of Done: Can run pytest and plot a heat map of time spent in functions.
Not urgent, but a good introduction for Joppe on the data processing pipelines.
Contact Joppe for questions/discussions/suggestions
In addition to the Definition of Done, the following always apply: