`Dependencies._column_loc`: files parameter has a mismatch between typing and implementation

ChristianGeng commented 4 months ago

Problem

Typing of Dependencies._column_loc seem not to be matching the implementation.

Input parameter files seems to be implemented for scalar input, whereas the type hint suggests that list input should be possible.

For scalar input df.at[files, column] would crash, whereas df.loc[files, column].values.tolist() would return a list.

In addition, the type conversion seems also to be implemented for scalar input only.

Possible solutions

change typing to only use scalar input and keep implementation
change the implementation and allow both, and keep with current typing. Doing so, in case of scalar input a scalar would be returned, in case of a list, also a list (of length corresponding to input length) is returned.

Additional points

why not allow tuples in typing?

Example

>> df.at['file-1.wav', column]
16
>>  df.at[['file-1.wav', 'file-0.wav'], column]
*** pandas.errors.InvalidIndexError: ['file-1.wav', 'file-0.wav']

>> df.loc[['file-1.wav', 'file-0.wav'], column].values.tolist()
[16, 16]

>> df.loc[['file-1.wav'], column].values.tolist()
[16]

Further Background

Benchmarking in benchmark-dependencies-methods.py uses list comprehensions of the form:

   [deps.bit_depth(file) for file in _files]

When evaluating polars this becomes tremendously slow (57 sec.).

However when I implement this based on a version of Dependencies._column_loc that can process iterable input and call it with deps.bit_depth(_files) this is fast (0.009280681610107422 sec.) which is a factor or more than 6000 times faster.

Should this issue maybe resolved before #385 can be meaningfully added to the benchmarks?

Saying that I would of course favor implementing the vectorized version of _column_loc - but I do not know how common the use of the list comprehension indeed is.

What would your take on this be @hagenw ?

hagenw commented 4 months ago

I had implemented this in #370, but was not able to find a real benefit and reverted the changes in https://github.com/audeering/audb/pull/393. A solution to this problem in line with #393 would be to change the typing to support only a scalar.

ChristianGeng commented 4 months ago

I had implemented this in #370, but was not able to find a real benefit and reverted the changes in #393. A solution to this problem in line with #393 would be to change the typing to support only a scalar.

In the context of benchmarking polars methods I have found quite large differences I have found quite large differences. typically a spedup factor of 600 or so.

I fear that the #370 implementation is lost as the commits got squared, right?

This is a second independent run, most times unvectorized is in the 2 seconds ballbark again.

method	pyarrow	string	object
Dependencies.call()	0.000	0.000	0.000
Dependencies.contains(10000 files)	0.003	0.003	0.003
Dependencies.__get_item__(10000 files)	0.652	0.252	0.165
Dependencies.len()	0.000	0.000	0.000
Dependencies.str()	0.009	0.006	0.003
Dependencies.archives	0.146	0.150	0.122
Dependencies.attachments	0.019	0.026	0.018
Dependencies.attachment_ids	0.018	0.031	0.018
Dependencies.files	0.012	0.012	0.012
Dependencies.media	0.069	0.115	0.106
Dependencies.removed_media	0.065	0.107	0.072
Dependencies.table_ids	0.025	0.037	0.026
Dependencies.tables	0.017	0.026	0.017
Dependencies.archive(10000 files) / vectorized	0.011	0.010	0.009
Dependencies.archive(10000 files)	0.045	0.030	0.026
Dependencies.bit_depth(10000 files) / vectorized	0.002	0.003	0.002
Dependencies.bit_depth(10000 files)	2.786	2.171	1.653
Dependencies.channels(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.channels(10000 files)	2.832	2.156	1.696
Dependencies.checksum(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.checksum(10000 files)	2.468	1.975	1.467
Dependencies.duration(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.duration(10000 files)	2.772	2.106	1.629
Dependencies.format(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.format(10000 files)	2.519	1.979	1.457
Dependencies.removed(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.removed(10000 files)	2.788	2.145	1.652
Dependencies.sampling_rate(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.sampling_rate(10000 files)	2.824	2.165	1.707
Dependencies.type(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.type(10000 files)	2.873	2.066	1.660
Dependencies.version(10000 files) / vectorized	0.003	0.003	0.003
Dependencies.version(10000 files)	2.605	1.900	1.519
Dependencies._add_attachment()	0.194	0.063	0.089
Dependencies._add_media(10000 files)	0.076	0.061	0.063
Dependencies._add_meta()	0.140	0.188	0.182
Dependencies._drop()	0.125	0.079	0.081
Dependencies._remove()	0.069	0.073	0.067
Dependencies._update_media()	0.153	0.084	0.088
Dependencies._update_media_version(10000 files)	0.029	0.010	0.010

hagenw commented 4 months ago

I fear that the https://github.com/audeering/audb/pull/370 implementation is lost as the commits got squared, right?

I could check out the branch locally, and pushed it again. So, I think you should be able to fetch it now as well. Otherwise, the changes are visible at https://github.com/audeering/audb/pull/370/files.

This is a second independent run

The unvectorized results are for the implementation presented here?

For comparison, results from #370

results from you

results from current main

Important is the pyarrow column. There your vectorized results are the fastest.

Also my implementation from #370 shows faster results for the vectorized solution. The main reason, I decided against #370 was that it does not translate into any real world advantages when loading a database, as briefly mentioned in https://github.com/audeering/audb/pull/375#issuecomment-2092815946, and that it complicates the API as the return value now depends on the input value, which is not so nice.

ChristianGeng commented 4 months ago

I have now tried out your version that uses .at for single files. I can verify that .at is much faster than .locfor single files. The problem is then not the list comprehension but the .loc in the implementation.

In this case I would be inline with you and do away the type hinting for sequences (and have only single file input).

audeering / audb