SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Request for discussion: Index unicality #432

Closed zverok closed 6 years ago

zverok commented 6 years ago

cc @v0dro @lokeshh (and everyone who is interested in Daru future are welcome! especially guys who use Daru in complicated production tasks, @info-rchitect? @genya0407?)

What we have now:

Now:

The only thing that comes to mind is supporting "dirty" data (e.g. I am loading data from some .csv, saying that first column should be index, and it turns out that it has duplicate values, and instead of "Error: Can't load non-unique data" I obtain valid DataFrame, which can be examined and cleaned up). Is there any other reasonable use cases?

This way or that, I believe that all indexes should be consistent: either allow duplicate values, or forbid them.

I'd like to understand clearly what is the most reasonable solution.

info-rchitect commented 6 years ago

@zverok I agree with your statement on index uniqueness, intuitively they should always be unique. It would be helpful to see the pandas spec tests for such a non-unique index. it would seem to be a drastic and huge leap to change to non-unique indices.

baarkerlounger commented 6 years ago

I also think indexes should be unique and it seems weird Pandas allows that. To me that's what categorical indexes are for. For the use case of importing dirty data from csv we now have the .uniq method on dataframe for cleaning duplicate values.

lokeshh commented 6 years ago

Use cases for categorical index:

  1. When you want constant time lookup with a column that's non unique

  2. Since we have daru-io integrating web databases with data frames, index uniqueness is not a restriction on databases. Imposing it on data frames would be a problem.

On Oct 17, 2017 8:38 AM, "Daniel Baark" notifications@github.com wrote:

I also think indexes should be unique and it seems weird Pandas allows that. To me that's what categorical indexes are for. For the use case of importing dirty data from csv we now have the .uniq method on dataframe for cleaning duplicate values.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SciRuby/daru/issues/432#issuecomment-337104667, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUmVHenPAHnP-21hOQoh1kt6jW0hElwks5stBoqgaJpZM4P542A .

zverok commented 6 years ago

@lokeshh

  1. When you want constant time lookup with a column that's non unique

But what is logical use case here? I don't believe "design, driven by optimization" is really a good thing.

  1. Since we have daru-io integrating web databases with data frames, index uniqueness is not a restriction on databases. Imposing it on data frames would be a problem.

Our index is "axis labels", nothing in common with DB index (which is access optimizer), and we don't import indexes from DB.

lokeshh commented 6 years ago

How about a vector where data value is country name and index is the continent these countries belong to? If we restrict index to be unique, we won't be able express this vector.

On Oct 17, 2017 4:54 PM, "Victor Shepelev" notifications@github.com wrote:

@lokeshh https://github.com/lokeshh

  1. When you want constant time lookup with a column that's non unique

But what is logical use case here? I don't believe "design, driven by optimization" is really a good thing.

  1. Since we have daru-io integrating web databases with data frames, index uniqueness is not a restriction on databases. Imposing it on data frames would be a problem.

Our index is "axis labels", nothing in common with DB index (which is access optimizer), and we don't import indexes from DB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SciRuby/daru/issues/432#issuecomment-337200891, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUmVFZlo2zGqDwEHuvjiNEiP6-f7Iq4ks5stI5YgaJpZM4P542A .

zverok commented 6 years ago

@lokeshh but what would be physical meaning of this vector? By definition, a vector is an ordered list of values; and the index is an axis labels for this values. Each label corresponds to exactly one value (or exactly one row in dataframe case). MultiIndex adds label clustering, DateTimeIndex adds slicing by time periods, but both preserve the invariant of "each value has its own label".

What kind of axis it is, which has same labels for different positions on axis? How can it be used in practice?

gnilrets commented 6 years ago

Why wouldn't you just use a two-vector dataframe to represent the continent-country situation?

lokeshh commented 6 years ago

Do you agree that index is a way to label data? If so then this vector can be thought of as a set of countries labeled by continents. Now why should we restrict labeling to be unique?

On Oct 19, 2017 2:37 AM, "Sterling Paramore" notifications@github.com wrote:

Why wouldn't you just use a two-vector dataframe to represent the continent-country situation?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SciRuby/daru/issues/432#issuecomment-337727979, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUmVIvHfjRYAPGHrq6DLH9iQ7KVC5OXks5stmh6gaJpZM4P542A .

zverok commented 6 years ago

Do you agree that index is a way to label data?

To provide axis labels, not to label any data with anything. Unless you'll show some real use case (with code, preferably), not theoretical one "we can say that apples are labels for oranges, how Vector handles this?"

baarkerlounger commented 6 years ago

Shouldn't the country/continent case be a categorical index? The continent is a category that can have multiple values...

gnilrets commented 6 years ago

I still don't understand why we need the index to represent anything. It really seems to me that an integer index would work and be far simpler to implement.

What can you accomplish with an index that represents meaningful data (like continent) that you can't accomplish with a 2-vector dataframe?

Certainly there will be some syntax differences: e.g., myvec['europe'] vs mydf.where[mydf['country'].eq('europe'). But really, I don't get why indexes have to contain data.

lokeshh commented 6 years ago

Hands down. Let's go for index uniqueness. It doesn't restrict any use case functionally but just a theoretical restriction for me which isn't such a big deal I guess.

On Oct 19, 2017 4:46 AM, "Sterling Paramore" notifications@github.com wrote:

I still don't understand why we need the index to represent anything. It really seems to me that an integer index would work and be far simpler to implement.

What can you accomplish with an index that represents meaningful data (like continent) that you can't accomplish with a 2-vector dataframe?

Certainly there will be some syntax differences: e.g., myvec['europe'] vs mydf.where[mydf['country'].eq('europe'). But really, I don't get why indexes have to contain data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SciRuby/daru/issues/432#issuecomment-337754769, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUmVCcWH5HaiMh4X88WtGbKSNC1eaMGks5stobigaJpZM4P542A .

zverok commented 6 years ago

Thanks everybody! Let it be so.