audeering / audb

Manage audio and video databases
https://audeering.github.io/audb/
Other
23 stars 1 forks source link

`Dependencies._column_loc`: files parameter has a mismatch between typing and implementation #407

Closed ChristianGeng closed 3 months ago

ChristianGeng commented 4 months ago

Problem

Typing of Dependencies._column_loc seem not to be matching the implementation.

For scalar input df.at[files, column] would crash, whereas df.loc[files, column].values.tolist() would return a list.

Possible solutions

Additional points

Example

>> df.at['file-1.wav', column]
16
>>  df.at[['file-1.wav', 'file-0.wav'], column]
*** pandas.errors.InvalidIndexError: ['file-1.wav', 'file-0.wav']

>> df.loc[['file-1.wav', 'file-0.wav'], column].values.tolist()
[16, 16]

>> df.loc[['file-1.wav'], column].values.tolist()
[16]

Further Background

Benchmarking in benchmark-dependencies-methods.py uses list comprehensions of the form:

   [deps.bit_depth(file) for file in _files]

When evaluating polars this becomes tremendously slow (57 sec.).

However when I implement this based on a version of Dependencies._column_loc that can process iterable input and call it with deps.bit_depth(_files) this is fast (0.009280681610107422 sec.) which is a factor or more than 6000 times faster.

Should this issue maybe resolved before #385 can be meaningfully added to the benchmarks?

Saying that I would of course favor implementing the vectorized version of _column_loc - but I do not know how common the use of the list comprehension indeed is.

What would your take on this be @hagenw ?

hagenw commented 4 months ago

I had implemented this in #370, but was not able to find a real benefit and reverted the changes in https://github.com/audeering/audb/pull/393. A solution to this problem in line with #393 would be to change the typing to support only a scalar.

ChristianGeng commented 4 months ago

I had implemented this in #370, but was not able to find a real benefit and reverted the changes in #393. A solution to this problem in line with #393 would be to change the typing to support only a scalar.

In the context of benchmarking polars methods I have found quite large differences I have found quite large differences. typically a spedup factor of 600 or so.

I fear that the #370 implementation is lost as the commits got squared, right?

This is a second independent run, most times unvectorized is in the 2 seconds ballbark again.

method pyarrow string object
Dependencies.call() 0.000 0.000 0.000
Dependencies.contains(10000 files) 0.003 0.003 0.003
Dependencies.__get_item__(10000 files) 0.652 0.252 0.165
Dependencies.len() 0.000 0.000 0.000
Dependencies.str() 0.009 0.006 0.003
Dependencies.archives 0.146 0.150 0.122
Dependencies.attachments 0.019 0.026 0.018
Dependencies.attachment_ids 0.018 0.031 0.018
Dependencies.files 0.012 0.012 0.012
Dependencies.media 0.069 0.115 0.106
Dependencies.removed_media 0.065 0.107 0.072
Dependencies.table_ids 0.025 0.037 0.026
Dependencies.tables 0.017 0.026 0.017
Dependencies.archive(10000 files) / vectorized 0.011 0.010 0.009
Dependencies.archive(10000 files) 0.045 0.030 0.026
Dependencies.bit_depth(10000 files) / vectorized 0.002 0.003 0.002
Dependencies.bit_depth(10000 files) 2.786 2.171 1.653
Dependencies.channels(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.channels(10000 files) 2.832 2.156 1.696
Dependencies.checksum(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.checksum(10000 files) 2.468 1.975 1.467
Dependencies.duration(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.duration(10000 files) 2.772 2.106 1.629
Dependencies.format(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.format(10000 files) 2.519 1.979 1.457
Dependencies.removed(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.removed(10000 files) 2.788 2.145 1.652
Dependencies.sampling_rate(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.sampling_rate(10000 files) 2.824 2.165 1.707
Dependencies.type(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.type(10000 files) 2.873 2.066 1.660
Dependencies.version(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.version(10000 files) 2.605 1.900 1.519
Dependencies._add_attachment() 0.194 0.063 0.089
Dependencies._add_media(10000 files) 0.076 0.061 0.063
Dependencies._add_meta() 0.140 0.188 0.182
Dependencies._drop() 0.125 0.079 0.081
Dependencies._remove() 0.069 0.073 0.067
Dependencies._update_media() 0.153 0.084 0.088
Dependencies._update_media_version(10000 files) 0.029 0.010 0.010
hagenw commented 4 months ago

I fear that the https://github.com/audeering/audb/pull/370 implementation is lost as the commits got squared, right?

I could check out the branch locally, and pushed it again. So, I think you should be able to fetch it now as well. Otherwise, the changes are visible at https://github.com/audeering/audb/pull/370/files.

This is a second independent run

The unvectorized results are for the implementation presented here?

For comparison, results from #370

image

image

results from you

image

results from current main

image

Important is the pyarrow column. There your vectorized results are the fastest.

Also my implementation from #370 shows faster results for the vectorized solution. The main reason, I decided against #370 was that it does not translate into any real world advantages when loading a database, as briefly mentioned in https://github.com/audeering/audb/pull/375#issuecomment-2092815946, and that it complicates the API as the return value now depends on the input value, which is not so nice.

ChristianGeng commented 4 months ago

I have now tried out your version that uses .at for single files. I can verify that .at is much faster than .locfor single files. The problem is then not the list comprehension but the .loc in the implementation.

In this case I would be inline with you and do away the type hinting for sequences (and have only single file input).