Open ChristianGeng opened 1 month ago
The implementation with polars
seem to face the same problem as I encountered when trying to use pyarrow.Table
instead of pandas.DataFrame
, see https://github.com/audeering/audb/pull/356. In general, performance is as good or better than with pandas.DataFrame
, but not when we need to address single rows.
Regarding lance
, I created https://github.com/audeering/audb/pull/425 as a first try on benchmarks. But to me it looks like its not worth to continue into this direction for now. Reading from a LANCE file is faster only when we stay with lance.LanceDataset
object. But when trying to work with it, I'm sure we will face similar problems regarding addressing single rows, we have seen in https://github.com/audeering/audb/pull/356 and we see here.
These two of your comments belong together. I will comment on all the changes in a separate thread summarizing all changes that I have made yesterday. The ghist of it is that the lack if speed for single elements has to do with the fact that only pandas has indices.
The implementation with
polars
seem to face the same problem as I encountered when trying to usepyarrow.Table
instead ofpandas.DataFrame
, see #356. In general, performance is as good or better than withpandas.DataFrame
, but not when we need to address single rows.This implementation is very slow at the moment when requesting a single file. Is there maybe something similar to df.at with polars to speed this up?
the previous previous version of this MR assumed that Dependencies._column_loc
would operate in a vectorized fashion. However, instead of implmenenting, we decided to also rollback the type hints, meaning that essentially we will work with single element access.
The migration guide, or in this blogpost for a more detailed fashion discuss about polars not implementing indices. This in essence means that when doing random access of a single element, access cannot be fast per se as the whole data has to be searched: From my basic understanding I think that taking the value of a given index would be O(1)
, but finding the index of a given value would then be O(N)
. This SOV post recommended to maintain a dict. I do not know much about how dicts are implemented, I would have thought that they use red-black tree or b-trees, but on the internet they say that they are hash tables. So I am unsure whether this is the best implementation but I have used a normal python dict for now. Sorry for being lengthy, but this is also the reason why pyarrow random access fails.
I am currently maintaining the index as a variable Dependencies._idx
and a private method Dependencis_update_idx
. It actually contains sth. like {'file0.wav': 0, "file1.wav": 1}
, then one can use and then use sth. like df.row(self._idx[file])
to locate elements. As many of the benchmark methods operate through Dependencies._column_loc
, all of these are affected.
pl.update
often with "outer join"When actually adding or changing data, I have refactored the slow methods using pl.update. This api is unstable though, and one would expect it to break at some time later.
For the methods where polars was fast in the first place I have not done so. So this is a little inconsistent.
__str__
__str__
had been slow with polars default settings. I have tweaked these to become slow, but tried to stay with the 15 lines of output that pandas uses.
method | pandas | polars | winner | factor |
---|---|---|---|---|
Dependencies.call() | 0.000 | 0.000 | polars | 2.667 |
Dependencies.contains(10000 files) | 0.003 | 0.002 | polars | 2.005 |
Dependencies.__get_item__(10000 files) | 0.648 | 0.013 | polars | 50.382 |
Dependencies.len() | 0.000 | 0.000 | pandas | 1.300 |
Dependencies.str() | 0.004 | 0.000 | polars | 24.677 |
Dependencies._add_attachment() | 0.171 | 0.104 | polars | 1.645 |
Dependencies._add_media(10000 files) | 0.073 | 0.008 | polars | 9.589 |
Dependencies._add_meta() | 0.127 | 0.100 | polars | 1.260 |
Dependencies._drop() | 0.118 | 0.021 | polars | 5.628 |
Dependencies._remove() | 0.067 | 0.002 | polars | 39.324 |
Dependencies._update_media() | 0.142 | 0.066 | polars | 2.148 |
Dependencies._update_media_version(10000 files) | 0.021 | 0.016 | polars | 1.341 |
Dependencies.archive(10000 files) | 0.045 | 0.014 | polars | 3.250 |
Dependencies.archives | 0.145 | 0.151 | pandas | 1.045 |
Dependencies.attachment_ids | 0.018 | 0.008 | polars | 2.375 |
Dependencies.attachments | 0.017 | 0.008 | polars | 2.194 |
Dependencies.bit_depth(10000 files) | 0.029 | 0.014 | polars | 2.031 |
Dependencies.channels(10000 files) | 0.030 | 0.013 | polars | 2.224 |
Dependencies.checksum(10000 files) | 0.030 | 0.014 | polars | 2.201 |
Dependencies.duration(10000 files) | 0.028 | 0.014 | polars | 2.066 |
Dependencies.files | 0.012 | 0.011 | polars | 1.040 |
Dependencies.format(10000 files) | 0.033 | 0.014 | polars | 2.345 |
Dependencies.media | 0.068 | 0.040 | polars | 1.702 |
Dependencies.removed(10000 files) | 0.029 | 0.014 | polars | 2.118 |
Dependencies.removed_media | 0.068 | 0.038 | polars | 1.809 |
Dependencies.sampling_rate(10000 files) | 0.029 | 0.014 | polars | 2.102 |
Dependencies.table_ids | 0.025 | 0.013 | polars | 1.927 |
Dependencies.tables | 0.017 | 0.008 | polars | 2.166 |
Dependencies.type(10000 files) | 0.028 | 0.014 | polars | 2.063 |
Dependencies.version(10000 files) | 0.032 | 0.013 | polars | 2.372 |
Great, thanks for your effort, now we can directly compare polars
to our current solution.
And it turns out that polars
is indeed slightly faster (or much faster for Dependencies.__get_item__()
).
There are a few points that need to be considered when switching to polars
for handling dependencies:
I would propose, to not consider switching to polars
for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.
I think it would make sense to merge this into the main
branch for documentation purposes.
Before doing so, could you also please update the requirements.txt
file in the benchmarks/
folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md
.
Great, thanks for your effort, now we can directly compare
polars
to our current solution. And it turns out thatpolars
is indeed slightly faster (or much faster forDependencies.__get_item__()
).There are a few points that need to be considered when switching to
polars
for handling dependencies:
- it would add another dependency
- we also need to see how performance of loading and saving parquet files is
I would propose, to not consider switching to
polars
for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.
I also think that this is quite ambitious for now: it would necessitate to refactor all tests, so this is a larger decision. I have not tackled the loading and saving here: My understanding was that pyarrow will be used under the hood anyway, so I perceived the more interesting comparisons in this module. Should a follow up issue be created to achive this?
I think it would make sense to merge this into the
main
branch for documentation purposes. Before doing so, could you also please update therequirements.txt
file in thebenchmarks/
folder, adding everything we need to run your scripts, and add the results tobenchmarks/README.md
.
I have updated the requirements and the README. I also committed the script that I used to compare the analysis.
In turn I have made the local utils.py
obsolete.
closes #385
Benchmarking Polars (methods)
The current merge requests monkeypatches the dependencies module and replaces it with a polars version. At the current stage I am unsure whether it makes sense to include it into the code base, or wether a ghist with the files doing such an evaluation would be the better change. This is due to the currently rather mixed results in the benchmarking table, and due to the fact that possibly we do not want to maintain a polars dependenies module given that there are future changes to the module.
Polars uses arrow memory automatically. Furthermore, a cast to type "Object" is impossible. Therefore the comparison is limited to a comparison between "Pandas/string" and Polars. Possibly one could compare against pyarrow if this is more meaningful, the polars results would stay identical.
It probably makes sense to alter the scope of the issue: while it is useful to do the method benchmarking using polars, for benchmarking file i/o it probably would make more sense to test whether lance is faster than other formats. Somehow this is a different question though
Comments:
"pandas casting" means that I have not been bothered with improving the implementation a lot. So polars convers df from pandas and back on return
concat(faster than in place):
An inplace version of this particular function would look like that
Interestingly it was slower than the concat version that assigns new memory for the returned table.
there are a few methods that use concat. Note that this for polars alters the sort order of the dataframe. As access is always per (pandas) index this should not matter, does it?
approx 14.14: is not run as the polars version is extremely slow for now.
both unvectorized: slow for polars, but an improvement is probably possible.
Results and Interpretation:
in the current scenario I can see no benefit of replacing pandas with the polars dataframe engine. What would make sense however is to streamlne the code to include #407 and streamline the code such that all methods using
Dependencies._column_loc
would use vectorized code. I found a few instances where the current implementation iterates over files and callsDependencies._column_loc
on these single files.Further direction: in order to be on the safe side, one should probably extend the polars method benchmarking to include different data sizes: it might be that using polars shines more on larger datasets (given that it is advertized using better threading). So the questions would be: How would a fortunate setting of
num_rows
andn_files
look like, witouth making it unrealistically bigI am tentatively requesting a review, despite the fact that I know that this might need to be changed and/or extended.