Closed ChristianGeng closed 3 months ago
I had implemented this in #370, but was not able to find a real benefit and reverted the changes in https://github.com/audeering/audb/pull/393. A solution to this problem in line with #393 would be to change the typing to support only a scalar.
I had implemented this in #370, but was not able to find a real benefit and reverted the changes in #393. A solution to this problem in line with #393 would be to change the typing to support only a scalar.
In the context of benchmarking polars methods I have found quite large differences I have found quite large differences. typically a spedup factor of 600 or so.
I fear that the #370 implementation is lost as the commits got squared, right?
This is a second independent run, most times unvectorized is in the 2 seconds ballbark again.
method | pyarrow | string | object |
---|---|---|---|
Dependencies.call() | 0.000 | 0.000 | 0.000 |
Dependencies.contains(10000 files) | 0.003 | 0.003 | 0.003 |
Dependencies.__get_item__(10000 files) | 0.652 | 0.252 | 0.165 |
Dependencies.len() | 0.000 | 0.000 | 0.000 |
Dependencies.str() | 0.009 | 0.006 | 0.003 |
Dependencies.archives | 0.146 | 0.150 | 0.122 |
Dependencies.attachments | 0.019 | 0.026 | 0.018 |
Dependencies.attachment_ids | 0.018 | 0.031 | 0.018 |
Dependencies.files | 0.012 | 0.012 | 0.012 |
Dependencies.media | 0.069 | 0.115 | 0.106 |
Dependencies.removed_media | 0.065 | 0.107 | 0.072 |
Dependencies.table_ids | 0.025 | 0.037 | 0.026 |
Dependencies.tables | 0.017 | 0.026 | 0.017 |
Dependencies.archive(10000 files) / vectorized | 0.011 | 0.010 | 0.009 |
Dependencies.archive(10000 files) | 0.045 | 0.030 | 0.026 |
Dependencies.bit_depth(10000 files) / vectorized | 0.002 | 0.003 | 0.002 |
Dependencies.bit_depth(10000 files) | 2.786 | 2.171 | 1.653 |
Dependencies.channels(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.channels(10000 files) | 2.832 | 2.156 | 1.696 |
Dependencies.checksum(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.checksum(10000 files) | 2.468 | 1.975 | 1.467 |
Dependencies.duration(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.duration(10000 files) | 2.772 | 2.106 | 1.629 |
Dependencies.format(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.format(10000 files) | 2.519 | 1.979 | 1.457 |
Dependencies.removed(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.removed(10000 files) | 2.788 | 2.145 | 1.652 |
Dependencies.sampling_rate(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.sampling_rate(10000 files) | 2.824 | 2.165 | 1.707 |
Dependencies.type(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.type(10000 files) | 2.873 | 2.066 | 1.660 |
Dependencies.version(10000 files) / vectorized | 0.003 | 0.003 | 0.003 |
Dependencies.version(10000 files) | 2.605 | 1.900 | 1.519 |
Dependencies._add_attachment() | 0.194 | 0.063 | 0.089 |
Dependencies._add_media(10000 files) | 0.076 | 0.061 | 0.063 |
Dependencies._add_meta() | 0.140 | 0.188 | 0.182 |
Dependencies._drop() | 0.125 | 0.079 | 0.081 |
Dependencies._remove() | 0.069 | 0.073 | 0.067 |
Dependencies._update_media() | 0.153 | 0.084 | 0.088 |
Dependencies._update_media_version(10000 files) | 0.029 | 0.010 | 0.010 |
I fear that the https://github.com/audeering/audb/pull/370 implementation is lost as the commits got squared, right?
I could check out the branch locally, and pushed it again. So, I think you should be able to fetch it now as well. Otherwise, the changes are visible at https://github.com/audeering/audb/pull/370/files.
This is a second independent run
The unvectorized results are for the implementation presented here?
For comparison, results from #370
results from you
results from current main
Important is the pyarrow column. There your vectorized results are the fastest.
Also my implementation from #370 shows faster results for the vectorized solution. The main reason, I decided against #370 was that it does not translate into any real world advantages when loading a database, as briefly mentioned in https://github.com/audeering/audb/pull/375#issuecomment-2092815946, and that it complicates the API as the return value now depends on the input value, which is not so nice.
I have now tried out your version that uses .at
for single files.
I can verify that .at
is much faster than .loc
for single files. The problem is then not the list comprehension but the .loc
in the implementation.
In this case I would be inline with you and do away the type hinting for sequences (and have only single file input).
Problem
Typing of
Dependencies._column_loc
seem not to be matching the implementation.files
seems to be implemented for scalar input, whereas the type hint suggests that list input should be possible.For scalar input
df.at[files, column]
would crash, whereasdf.loc[files, column].values.tolist()
would return a list.Possible solutions
Additional points
Example
Further Background
Benchmarking in benchmark-dependencies-methods.py uses list comprehensions of the form:
When evaluating polars this becomes tremendously slow (
57
sec.).However when I implement this based on a version of
Dependencies._column_loc
that can process iterable input and call it withdeps.bit_depth(_files)
this is fast (0.009280681610107422
sec.) which is a factor or more than 6000 times faster.Should this issue maybe resolved before #385 can be meaningfully added to the benchmarks?
Saying that I would of course favor implementing the vectorized version of
_column_loc
- but I do not know how common the use of the list comprehension indeed is.What would your take on this be @hagenw ?