audeering / audb

Manage audio and video databases
https://audeering.github.io/audb/
Other
23 stars 1 forks source link

Polars benchmarks methods #424

Open ChristianGeng opened 1 month ago

ChristianGeng commented 1 month ago

closes #385

Benchmarking Polars (methods)

The current merge requests monkeypatches the dependencies module and replaces it with a polars version. At the current stage I am unsure whether it makes sense to include it into the code base, or wether a ghist with the files doing such an evaluation would be the better change. This is due to the currently rather mixed results in the benchmarking table, and due to the fact that possibly we do not want to maintain a polars dependenies module given that there are future changes to the module.

Polars uses arrow memory automatically. Furthermore, a cast to type "Object" is impossible. Therefore the comparison is limited to a comparison between "Pandas/string" and Polars. Possibly one could compare against pyarrow if this is more meaningful, the polars results would stay identical.

It probably makes sense to alter the scope of the issue: while it is useful to do the method benchmarking using polars, for benchmarking file i/o it probably would make more sense to test whether lance is faster than other formats. Somehow this is a different question though

method pandas polars comment winner
Dependencies.call() 0.000 0.000 polars
Dependencies.contains(10000 files) 0.007 0.674 both unvectorized pandas
Dependencies.__get_item__(10000 files) 0.242 nan approx 14.14 pandas
Dependencies.len() 0.000 0.000 pandas
Dependencies.str() 0.004 0.003 polars
Dependencies._add_attachment() 0.060 0.946 pandas casting pandas
Dependencies._add_media(10000 files) 0.066 0.040 polars
Dependencies._add_meta() 0.183 1.245 pandas casting pandas
Dependencies._drop() 0.076 0.063 polars
Dependencies._remove() 0.067 0.058 polars
Dependencies._update_media() 0.082 0.055 polars
Dependencies._update_media_version(10000 files) 0.010 0.100 concat(faster than in place) pandas
Dependencies.archive(10000 files) 0.029 nan pandas
Dependencies.archive(10000 files) / vectorized 0.009 nan pandas
Dependencies.archives 0.143 0.213 pandas
Dependencies.attachment_ids 0.031 0.009 polars
Dependencies.attachments 0.027 0.011 polars
Dependencies.bit_depth(10000 files) 2.152 nan pandas
Dependencies.bit_depth(10000 files) / vectorized 0.003 0.010 pandas
Dependencies.channels(10000 files) 2.128 nan pandas
Dependencies.channels(10000 files) / vectorized 0.003 0.007 pandas
Dependencies.checksum(10000 files) 1.933 nan pandas
Dependencies.checksum(10000 files) / vectorized 0.003 0.007 pandas
Dependencies.duration(10000 files) 2.153 nan pandas
Dependencies.duration(10000 files) / vectorized 0.003 0.007 pandas
Dependencies.files 0.013 0.038 pandas
Dependencies.format(10000 files) 2.030 nan pandas
Dependencies.format(10000 files) / vectorized 0.003 0.012 pandas
Dependencies.media 0.117 0.042 polars
Dependencies.removed(10000 files) 2.200 nan pandas
Dependencies.removed(10000 files) / vectorized 0.003 0.010 pandas
Dependencies.removed_media 0.107 0.066 polars
Dependencies.sampling_rate(10000 files) 2.233 nan pandas
Dependencies.sampling_rate(10000 files) / vectorized 0.003 0.008 pandas
Dependencies.table_ids 0.035 0.013 polars
Dependencies.tables 0.024 0.008 polars
Dependencies.type(10000 files) 2.252 nan pandas
Dependencies.type(10000 files) / vectorized 0.004 0.008 pandas
Dependencies.version(10000 files) 2.024 nan pandas
Dependencies.version(10000 files) / vectorized 0.003 0.008 pandas

Comments:

Results and Interpretation:

in the current scenario I can see no benefit of replacing pandas with the polars dataframe engine. What would make sense however is to streamlne the code to include #407 and streamline the code such that all methods using Dependencies._column_loc would use vectorized code. I found a few instances where the current implementation iterates over files and calls Dependencies._column_loc on these single files.

method pandas
Dependencies.archive(10000 files) 0.029
Dependencies.bit_depth(10000 files) 2.152
Dependencies.channels(10000 files) 2.128
Dependencies.checksum(10000 files) 1.933
Dependencies.duration(10000 files) 2.153
Dependencies.format(10000 files) 2.030
Dependencies.removed(10000 files) 2.200
Dependencies.sampling_rate(10000 files) 2.233
Dependencies.type(10000 files) 2.252
Dependencies.version(10000 files) 2.024
Dependencies.archive(10000 files) / vectorized 0.009
Dependencies.bit_depth(10000 files) / vectorized 0.003
Dependencies.channels(10000 files) / vectorized 0.003
Dependencies.checksum(10000 files) / vectorized 0.003
Dependencies.duration(10000 files) / vectorized 0.003
Dependencies.format(10000 files) / vectorized 0.003
Dependencies.removed(10000 files) / vectorized 0.003
Dependencies.sampling_rate(10000 files) / vectorized 0.003
Dependencies.type(10000 files) / vectorized 0.004
Dependencies.version(10000 files) / vectorized 0.003

Further direction: in order to be on the safe side, one should probably extend the polars method benchmarking to include different data sizes: it might be that using polars shines more on larger datasets (given that it is advertized using better threading). So the questions would be: How would a fortunate setting of num_rows and n_files look like, witouth making it unrealistically big

I am tentatively requesting a review, despite the fact that I know that this might need to be changed and/or extended.

hagenw commented 1 month ago

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see https://github.com/audeering/audb/pull/356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

hagenw commented 1 month ago

Regarding lance, I created https://github.com/audeering/audb/pull/425 as a first try on benchmarks. But to me it looks like its not worth to continue into this direction for now. Reading from a LANCE file is faster only when we stay with lance.LanceDataset object. But when trying to work with it, I'm sure we will face similar problems regarding addressing single rows, we have seen in https://github.com/audeering/audb/pull/356 and we see here.

ChristianGeng commented 1 week ago

These two of your comments belong together. I will comment on all the changes in a separate thread summarizing all changes that I have made yesterday. The ghist of it is that the lack if speed for single elements has to do with the fact that only pandas has indices.

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see #356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

This implementation is very slow at the moment when requesting a single file. Is there maybe something similar to df.at with polars to speed this up?

ChristianGeng commented 1 week ago

Treatment Index variable

the previous previous version of this MR assumed that Dependencies._column_loc would operate in a vectorized fashion. However, instead of implmenenting, we decided to also rollback the type hints, meaning that essentially we will work with single element access.

The migration guide, or in this blogpost for a more detailed fashion discuss about polars not implementing indices. This in essence means that when doing random access of a single element, access cannot be fast per se as the whole data has to be searched: From my basic understanding I think that taking the value of a given index would be O(1), but finding the index of a given value would then be O(N). This SOV post recommended to maintain a dict. I do not know much about how dicts are implemented, I would have thought that they use red-black tree or b-trees, but on the internet they say that they are hash tables. So I am unsure whether this is the best implementation but I have used a normal python dict for now. Sorry for being lengthy, but this is also the reason why pyarrow random access fails.

I am currently maintaining the index as a variable Dependencies._idx and a private method Dependencis_update_idx. It actually contains sth. like {'file0.wav': 0, "file1.wav": 1}, then one can use and then use sth. like df.row(self._idx[file]) to locate elements. As many of the benchmark methods operate through Dependencies._column_loc, all of these are affected.

use pl.update often with "outer join"

When actually adding or changing data, I have refactored the slow methods using pl.update. This api is unstable though, and one would expect it to break at some time later.

For the methods where polars was fast in the first place I have not done so. So this is a little inconsistent.

__str__

__str__ had been slow with polars default settings. I have tweaked these to become slow, but tried to stay with the 15 lines of output that pandas uses.

Further comments

method pandas polars winner factor
Dependencies.call() 0.000 0.000 polars 2.667
Dependencies.contains(10000 files) 0.003 0.002 polars 2.005
Dependencies.__get_item__(10000 files) 0.648 0.013 polars 50.382
Dependencies.len() 0.000 0.000 pandas 1.300
Dependencies.str() 0.004 0.000 polars 24.677
Dependencies._add_attachment() 0.171 0.104 polars 1.645
Dependencies._add_media(10000 files) 0.073 0.008 polars 9.589
Dependencies._add_meta() 0.127 0.100 polars 1.260
Dependencies._drop() 0.118 0.021 polars 5.628
Dependencies._remove() 0.067 0.002 polars 39.324
Dependencies._update_media() 0.142 0.066 polars 2.148
Dependencies._update_media_version(10000 files) 0.021 0.016 polars 1.341
Dependencies.archive(10000 files) 0.045 0.014 polars 3.250
Dependencies.archives 0.145 0.151 pandas 1.045
Dependencies.attachment_ids 0.018 0.008 polars 2.375
Dependencies.attachments 0.017 0.008 polars 2.194
Dependencies.bit_depth(10000 files) 0.029 0.014 polars 2.031
Dependencies.channels(10000 files) 0.030 0.013 polars 2.224
Dependencies.checksum(10000 files) 0.030 0.014 polars 2.201
Dependencies.duration(10000 files) 0.028 0.014 polars 2.066
Dependencies.files 0.012 0.011 polars 1.040
Dependencies.format(10000 files) 0.033 0.014 polars 2.345
Dependencies.media 0.068 0.040 polars 1.702
Dependencies.removed(10000 files) 0.029 0.014 polars 2.118
Dependencies.removed_media 0.068 0.038 polars 1.809
Dependencies.sampling_rate(10000 files) 0.029 0.014 polars 2.102
Dependencies.table_ids 0.025 0.013 polars 1.927
Dependencies.tables 0.017 0.008 polars 2.166
Dependencies.type(10000 files) 0.028 0.014 polars 2.063
Dependencies.version(10000 files) 0.032 0.013 polars 2.372
hagenw commented 1 week ago

Great, thanks for your effort, now we can directly compare polars to our current solution. And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I think it would make sense to merge this into the main branch for documentation purposes. Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

ChristianGeng commented 1 week ago

Great, thanks for your effort, now we can directly compare polars to our current solution. And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

  • it would add another dependency
  • we also need to see how performance of loading and saving parquet files is

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I also think that this is quite ambitious for now: it would necessitate to refactor all tests, so this is a larger decision. I have not tackled the loading and saving here: My understanding was that pyarrow will be used under the hood anyway, so I perceived the more interesting comparisons in this module. Should a follow up issue be created to achive this?

I think it would make sense to merge this into the main branch for documentation purposes. Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

I have updated the requirements and the README. I also committed the script that I used to compare the analysis. In turn I have made the local utils.py obsolete.