ChristianGeng commented 1 month ago

closes #385

Benchmarking Polars (methods)

The current merge requests monkeypatches the dependencies module and replaces it with a polars version. At the current stage I am unsure whether it makes sense to include it into the code base, or wether a ghist with the files doing such an evaluation would be the better change. This is due to the currently rather mixed results in the benchmarking table, and due to the fact that possibly we do not want to maintain a polars dependenies module given that there are future changes to the module.

Polars uses arrow memory automatically. Furthermore, a cast to type "Object" is impossible. Therefore the comparison is limited to a comparison between "Pandas/string" and Polars. Possibly one could compare against pyarrow if this is more meaningful, the polars results would stay identical.

It probably makes sense to alter the scope of the issue: while it is useful to do the method benchmarking using polars, for benchmarking file i/o it probably would make more sense to test whether lance is faster than other formats. Somehow this is a different question though

method	pandas	polars	comment	winner
Dependencies.call()	0.000	0.000		polars
Dependencies.contains(10000 files)	0.007	0.674	both unvectorized	pandas
Dependencies.__get_item__(10000 files)	0.242	nan	approx 14.14	pandas
Dependencies.len()	0.000	0.000		pandas
Dependencies.str()	0.004	0.003		polars
Dependencies._add_attachment()	0.060	0.946	pandas casting	pandas
Dependencies._add_media(10000 files)	0.066	0.040		polars
Dependencies._add_meta()	0.183	1.245	pandas casting	pandas
Dependencies._drop()	0.076	0.063		polars
Dependencies._remove()	0.067	0.058		polars
Dependencies._update_media()	0.082	0.055		polars
Dependencies._update_media_version(10000 files)	0.010	0.100	concat(faster than in place)	pandas
Dependencies.archive(10000 files)	0.029	nan		pandas
Dependencies.archive(10000 files) / vectorized	0.009	nan		pandas
Dependencies.archives	0.143	0.213		pandas
Dependencies.attachment_ids	0.031	0.009		polars
Dependencies.attachments	0.027	0.011		polars
Dependencies.bit_depth(10000 files)	2.152	nan		pandas
Dependencies.bit_depth(10000 files) / vectorized	0.003	0.010		pandas
Dependencies.channels(10000 files)	2.128	nan		pandas
Dependencies.channels(10000 files) / vectorized	0.003	0.007		pandas
Dependencies.checksum(10000 files)	1.933	nan		pandas
Dependencies.checksum(10000 files) / vectorized	0.003	0.007		pandas
Dependencies.duration(10000 files)	2.153	nan		pandas
Dependencies.duration(10000 files) / vectorized	0.003	0.007		pandas
Dependencies.files	0.013	0.038		pandas
Dependencies.format(10000 files)	2.030	nan		pandas
Dependencies.format(10000 files) / vectorized	0.003	0.012		pandas
Dependencies.media	0.117	0.042		polars
Dependencies.removed(10000 files)	2.200	nan		pandas
Dependencies.removed(10000 files) / vectorized	0.003	0.010		pandas
Dependencies.removed_media	0.107	0.066		polars
Dependencies.sampling_rate(10000 files)	2.233	nan		pandas
Dependencies.sampling_rate(10000 files) / vectorized	0.003	0.008		pandas
Dependencies.table_ids	0.035	0.013		polars
Dependencies.tables	0.024	0.008		polars
Dependencies.type(10000 files)	2.252	nan		pandas
Dependencies.type(10000 files) / vectorized	0.004	0.008		pandas
Dependencies.version(10000 files)	2.024	nan		pandas
Dependencies.version(10000 files) / vectorized	0.003	0.008		pandas

Comments:

"pandas casting" means that I have not been bothered with improving the implementation a lot. So polars convers df from pandas and back on return
concat(faster than in place):

An inplace version of this particular function would look like that
```
self._df.with_columns(
      pl.when(pl.col(self.index_col).is_in(files)).then(
          pl.col(field).str.replace(".*",
                                    version)).otherwise(pl.col(field)))
```
Interestingly it was slower than the concat version that assigns new memory for the returned table.
there are a few methods that use concat. Note that this for polars alters the sort order of the dataframe. As access is always per (pandas) index this should not matter, does it?
approx 14.14: is not run as the polars version is extremely slow for now.
both unvectorized: slow for polars, but an improvement is probably possible.

Results and Interpretation:

in the current scenario I can see no benefit of replacing pandas with the polars dataframe engine. What would make sense however is to streamlne the code to include #407 and streamline the code such that all methods using Dependencies._column_loc would use vectorized code. I found a few instances where the current implementation iterates over files and calls Dependencies._column_loc on these single files.

method	pandas
Dependencies.archive(10000 files)	0.029
Dependencies.bit_depth(10000 files)	2.152
Dependencies.channels(10000 files)	2.128
Dependencies.checksum(10000 files)	1.933
Dependencies.duration(10000 files)	2.153
Dependencies.format(10000 files)	2.030
Dependencies.removed(10000 files)	2.200
Dependencies.sampling_rate(10000 files)	2.233
Dependencies.type(10000 files)	2.252
Dependencies.version(10000 files)	2.024
Dependencies.archive(10000 files) / vectorized	0.009
Dependencies.bit_depth(10000 files) / vectorized	0.003
Dependencies.channels(10000 files) / vectorized	0.003
Dependencies.checksum(10000 files) / vectorized	0.003
Dependencies.duration(10000 files) / vectorized	0.003
Dependencies.format(10000 files) / vectorized	0.003
Dependencies.removed(10000 files) / vectorized	0.003
Dependencies.sampling_rate(10000 files) / vectorized	0.003
Dependencies.type(10000 files) / vectorized	0.004
Dependencies.version(10000 files) / vectorized	0.003

Further direction: in order to be on the safe side, one should probably extend the polars method benchmarking to include different data sizes: it might be that using polars shines more on larger datasets (given that it is advertized using better threading). So the questions would be: How would a fortunate setting of num_rows and n_files look like, witouth making it unrealistically big

I am tentatively requesting a review, despite the fact that I know that this might need to be changed and/or extended.

hagenw commented 1 month ago

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see https://github.com/audeering/audb/pull/356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

hagenw commented 1 month ago

Regarding lance, I created https://github.com/audeering/audb/pull/425 as a first try on benchmarks. But to me it looks like its not worth to continue into this direction for now. Reading from a LANCE file is faster only when we stay with lance.LanceDataset object. But when trying to work with it, I'm sure we will face similar problems regarding addressing single rows, we have seen in https://github.com/audeering/audb/pull/356 and we see here.

ChristianGeng commented 1 week ago

These two of your comments belong together. I will comment on all the changes in a separate thread summarizing all changes that I have made yesterday. The ghist of it is that the lack if speed for single elements has to do with the fact that only pandas has indices.

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see #356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

This implementation is very slow at the moment when requesting a single file. Is there maybe something similar to df.at with polars to speed this up?

ChristianGeng commented 1 week ago

Treatment Index variable

the previous previous version of this MR assumed that Dependencies._column_loc would operate in a vectorized fashion. However, instead of implmenenting, we decided to also rollback the type hints, meaning that essentially we will work with single element access.

The migration guide, or in this blogpost for a more detailed fashion discuss about polars not implementing indices. This in essence means that when doing random access of a single element, access cannot be fast per se as the whole data has to be searched: From my basic understanding I think that taking the value of a given index would be O(1), but finding the index of a given value would then be O(N). This SOV post recommended to maintain a dict. I do not know much about how dicts are implemented, I would have thought that they use red-black tree or b-trees, but on the internet they say that they are hash tables. So I am unsure whether this is the best implementation but I have used a normal python dict for now. Sorry for being lengthy, but this is also the reason why pyarrow random access fails.

I am currently maintaining the index as a variable Dependencies._idx and a private method Dependencis_update_idx. It actually contains sth. like {'file0.wav': 0, "file1.wav": 1}, then one can use and then use sth. like df.row(self._idx[file]) to locate elements. As many of the benchmark methods operate through Dependencies._column_loc, all of these are affected.

use `pl.update` often with "outer join"

When actually adding or changing data, I have refactored the slow methods using pl.update. This api is unstable though, and one would expect it to break at some time later.

For the methods where polars was fast in the first place I have not done so. So this is a little inconsistent.

`str`

__str__ had been slow with polars default settings. I have tweaked these to become slow, but tried to stay with the 15 lines of output that pandas uses.

Further comments

Dependencies.load currently only implemented for parquet file
test data are not created

method	pandas	polars	winner	factor
Dependencies.call()	0.000	0.000	polars	2.667
Dependencies.contains(10000 files)	0.003	0.002	polars	2.005
Dependencies.__get_item__(10000 files)	0.648	0.013	polars	50.382
Dependencies.len()	0.000	0.000	pandas	1.300
Dependencies.str()	0.004	0.000	polars	24.677
Dependencies._add_attachment()	0.171	0.104	polars	1.645
Dependencies._add_media(10000 files)	0.073	0.008	polars	9.589
Dependencies._add_meta()	0.127	0.100	polars	1.260
Dependencies._drop()	0.118	0.021	polars	5.628
Dependencies._remove()	0.067	0.002	polars	39.324
Dependencies._update_media()	0.142	0.066	polars	2.148
Dependencies._update_media_version(10000 files)	0.021	0.016	polars	1.341
Dependencies.archive(10000 files)	0.045	0.014	polars	3.250
Dependencies.archives	0.145	0.151	pandas	1.045
Dependencies.attachment_ids	0.018	0.008	polars	2.375
Dependencies.attachments	0.017	0.008	polars	2.194
Dependencies.bit_depth(10000 files)	0.029	0.014	polars	2.031
Dependencies.channels(10000 files)	0.030	0.013	polars	2.224
Dependencies.checksum(10000 files)	0.030	0.014	polars	2.201
Dependencies.duration(10000 files)	0.028	0.014	polars	2.066
Dependencies.files	0.012	0.011	polars	1.040
Dependencies.format(10000 files)	0.033	0.014	polars	2.345
Dependencies.media	0.068	0.040	polars	1.702
Dependencies.removed(10000 files)	0.029	0.014	polars	2.118
Dependencies.removed_media	0.068	0.038	polars	1.809
Dependencies.sampling_rate(10000 files)	0.029	0.014	polars	2.102
Dependencies.table_ids	0.025	0.013	polars	1.927
Dependencies.tables	0.017	0.008	polars	2.166
Dependencies.type(10000 files)	0.028	0.014	polars	2.063
Dependencies.version(10000 files)	0.032	0.013	polars	2.372

hagenw commented 1 week ago

Great, thanks for your effort, now we can directly compare polars to our current solution. And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

it would add another dependency
we also need to see how performance of loading and saving parquet files is

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I think it would make sense to merge this into the main branch for documentation purposes. Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

ChristianGeng commented 1 week ago

Great, thanks for your effort, now we can directly compare polars to our current solution. And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

it would add another dependency

we also need to see how performance of loading and saving parquet files is

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I also think that this is quite ambitious for now: it would necessitate to refactor all tests, so this is a larger decision. I have not tackled the loading and saving here: My understanding was that pyarrow will be used under the hood anyway, so I perceived the more interesting comparisons in this module. Should a follow up issue be created to achive this?

I think it would make sense to merge this into the main branch for documentation purposes. Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

I have updated the requirements and the README. I also committed the script that I used to compare the analysis. In turn I have made the local utils.py obsolete.

audeering / audb

Polars benchmarks methods #424

Benchmarking Polars (methods)

Treatment Index variable

use `pl.update` often with "outer join"

`str`

Further comments

audeering / audb

Polars benchmarks methods #424

Benchmarking Polars (methods)

Treatment Index variable

use pl.update often with "outer join"

__str__

Further comments

use `pl.update` often with "outer join"

`str`