audeering / audeer

Helpful Python functions
https://audeering.github.io/audeer/
Other
3 stars 0 forks source link

Add audeer.unique() #149

Closed hagenw closed 3 months ago

hagenw commented 3 months ago

It might be that a user wants to get unique values of a sequence, in the order they are appearing in that sequence.

image

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.0%. Comparing base (ec72190) to head (e108c98). Report is 2 commits behind head on main.

Additional details and impacted files | [Files](https://app.codecov.io/gh/audeering/audeer/pull/149?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering) | Coverage Δ | | |---|---|---| | [audeer/\_\_init\_\_.py](https://app.codecov.io/gh/audeering/audeer/pull/149?src=pr&el=tree&filepath=audeer%2F__init__.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering#diff-YXVkZWVyL19faW5pdF9fLnB5) | `100.0% <100.0%> (ø)` | | | [audeer/core/utils.py](https://app.codecov.io/gh/audeering/audeer/pull/149?src=pr&el=tree&filepath=audeer%2Fcore%2Futils.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering#diff-YXVkZWVyL2NvcmUvdXRpbHMucHk=) | `100.0% <100.0%> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/audeering/audeer/pull/149/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=audeering)
hagenw commented 3 months ago

typing.Sequence should be fine as the code will work with every object that implements __iter__.

But to support that, it is usually easier to convert everything to a fixed output type, instead of returning the original type. I also prefer to have a fixed output type in basic functions instead of defining the output type by the input type.

ChristianGeng commented 3 months ago

typing.Sequence should be fine as the code will work with every object that implements __iter__.

But to support that, it is usually easier to convert everything to a fixed output type, instead of returning the original type. I also prefer to have a fixed output type in basic functions instead of defining the output type by the input type.

I see: a DataFrame is not a sequence:

import pandas as pd
import collections.abc
df = pd.DataFrame.from_dict({
    'v1': [1,2,1,2,4,1],
    'v2': [2,4,4,2,3,1],
    }
)
typecheck = isinstance(df, collections.abc.Sequence)
print(typecheck)
print(df.__iter__)

results in

False
<bound method NDFrame.__iter__ of    v1  v2
0   1   2
1   2   4
2   1   4
3   2   2
4   4   3
5   1   1>

In other words: pd.Dataframe implements __iter__ but the object is not a sequence. Interestingly, a pd.Series is not a sequence either. Still unique works on it.

ser = pd.Series([1,2,1,2,4,1])
isinstance(ser, collections.abc.Sequence)
>>> False

But I am happy with it and will approve it as this becomes to academic. I think one should add a string to the test case and then I will approve.

ChristianGeng commented 3 months ago

I accidentally pressed the "close" button next to the comment button. Reopening. Apologies.

hagenw commented 3 months ago

In the current Python documentation, they write:

image

and we have

>>> isinstance(pd.Series([1]), collections.abc.Iterable)
True

So I changed the type to typing.Iterable, which seems to be a better fit to me. Do you agree?

ChristianGeng commented 3 months ago

So I changed the type to typing.Iterable, which seems to be a better fit to me. Do you agree?

Yes. A DataFrame is also an iterable though. I have approved already - it is probably hard to impossible to completely get it right with the python typing system: when one goes through the tests one can see what is going on. Alright with me.