Open zverok opened 7 years ago
@zverok , can you please show some examples and expected output ?
@Shekharrajak, I believe that:
E.g.:
df = Daru::DataFrame.new({b: [11,12], a: [101,102], c: [11,22]},
order: [:a, :b, :c],
index: [[:k], [:l]])
# v1:
# ArgumentError: MultiIndex can't consist of single-element tuples!
# or v2:
df.index
# => #<Daru::Index(2): {k, l}> -- not MultiIndex!
And
df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [101,102,103,104,105],
c: [11,22,33,44,55]},
order: [:a, :b, :c],
index: [[:k], [:k], [:k], [:l], [:l]])
# ArgumentError: repeating values in index!
Thanks! I think for 1st example v2 will be good.
For 2nd example: I think, it should allow repeating index values. Means in 2nd example df
must be :
=> #<Daru::DataFrame(5x3)>
a b c
k 101 11 11
k 102 12 22
k 103 13 33
l 104 14 44
l 105 15 55
So when user want values in indexk
:
df[:a][:k]
a
k 101
k 102
k 103
That means
irb(main):025:0> df = Daru::DataFrame.new({b: [11,12], a: [101,102], c: [11,22]},
irb(main):026:1* order: [:a, :b, :c],
irb(main):027:1* index: [[:k, :m], [:k, :m]])
=> #<Daru::DataFrame(2x3)>
a b c
k m 101 11 11
m 102 12 22
not this :
=> #<Daru::DataFrame(2x3)>
a b c
k m 101 11 11
102 12 22
So that we can access the rows using df[:a][:k]
, means :
a
m 101
m 102
Is it good idea ? @zverok
I think, it should allow repeating index values.
I believe, index by definition should be unique (it becames complicated with "category indexes" and I do not feel clearly understanding matters, but generic rule is simple: "index is unique names for rows"). But it is just my opinion.
@v0dro @lokeshh WDYT?
Pandas allows repeating values in index. However, since we haven't come across a concrete use case where this functionality is useful, I think there is no need to spend effort on making it happen. We will most likely need to change the underlying data structure for storing the index (its currently a Hash) and making it as fast as a Hash (in pure Ruby) will be a challenge.
@zverok CategoricalIndex
is there to deal with duplicate indexes, so I think its fine if we restrict Index
and MultiIndex
to be unique but I can't agree with the definition that it should be unique because the widely accepted view of index is
"A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure."(see https://en.wikipedia.org/wiki/Database_index)
which doesn't presume that index should be unique.
Lokesh has a point. However lets put off the uniqueness issue until someone comes up with a concrete use case.
"A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure."
I don't believe "database index" is a good metaphor here: in this case it should be auxiliary structure, added to dataframe for easier access (and we could have 10 different indexes for different types of access).
Dataframe index is rather an unique names for the rows as far as I can understand, and therefore https://en.wikipedia.org/wiki/Index_(publishing) is better comparison.
I'm still struggling with understanding why indexes more complex than sequential integers are really necessary for dataframes. Ideally, #where
on a single vector should be as performance as any index lookup, especially since we're restricted to only one index per dataframe.
Ideally,
#where
on a single vector should be as performance as any index lookup, especially since we're restricted to only one index per dataframe.
@gnilrets We cannot increase the lookup performance of #where
because it costs us additional updates and writes which are expensive. This is the whole point of having an index. Index gives us faster lookup but with an additional cost.
@gnilrets Do you agree?
I'm still struggling with understanding why indexes more complex than sequential integers are really necessary for dataframes.
At least, because of "special" indexes (MultiIndex, which is easy to slice by part of tuple, and DateTimeIndex, where you can query the entire year). I believe that notion of Index in the meaning we use it in Daru cames from spreadsheets/accounting, and typical tables looking like
Observation1 | Observation2 | Observation 3 | |
---|---|---|---|
Subject1 | value11 | value12 | value13 |
Subject2 | value21 | value22 | value33 |
Subject3 | value31 | value32 | value33 |
This is typical way how scientists think of data, I believe.
I think if if indexes are not unique then Daru::Index
must automatically go to the Daru::CategoricalIndex
like how Daru::Index
returns Daru::MultiIndex
when tuples are passed.
Means
irb(main):012:0> Daru::Index.new([1,2,3])
=> #<Daru::Index(3): {1, 2, 3}>
irb(main):013:0> Daru::Index.new([[1,2,3], [2,3,4]])
=> #<Daru::MultiIndex(2x3)>
1 2 3
2 3 4
irb(main):014:0> Daru::Index.new([1,1,2,2,3,3])
=> #<Daru::Index(3): {1, 2, 3}> # this must be => #<Daru::CategoricalIndex(6): {1, 1, 2, 2, 3, 3}>
Isn't good?
I am using Categorical Index when there is only one level and labels left (and duplicate index present), see : https://github.com/SciRuby/daru/pull/340/files#diff-df0c816a5a6b82ab4d961bf9d1a0acbfR248
Shown at https://github.com/SciRuby/daru/pull/340
Problems: