VorTECHsa / python-sdk

Vortexa DataScience SDK
https://vortechsa.github.io/python-sdk/
Apache License 2.0
21 stars 9 forks source link

Improve performance for TimeSeriesResult.to_df #497

Closed joloppo closed 5 months ago

joloppo commented 5 months ago

This still does not solve threadpool being slow for low dimensionality data. ie only 2 columns and only <100 or so records.

However, this improves one part of the code that was very slow, and disables the theadpool if only 1 core is available.

RELATED TICKETS

https://vortexa.atlassian.net/browse/RND-7233

CHANGELOG

TESTS

Perf

Perf testing results for thousands of rows & columns:

1 year worth of data for this:
CargoTimeSeries().search(
            timeseries_frequency='day',
            timeseries_property='origin_terminal',
            timeseries_activity='loading_end',
            filter_activity='oil_on_water_state',
            filter_time_min=datetime(2021, 1, 1),
            filter_time_max=datetime(2021, 12, 31))

127.51475930213928 seconds -> mp pool 
175 seconds -> no mp pool

pool with fix 
(44543, 2571)
Time taken to convert to DataFrame: 38.43424606323242 seconds

nopool with fix
(44543, 2571)
Time taken to convert to DataFrame: 86.82875204086304 seconds