SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

Unexpected behavior during vector assignment: df[0] = df[0] + 10 #496

Open gioele opened 5 years ago

gioele commented 5 years ago

DataFrame assignments behave in a surprising way then numerical indexes are used: instead of the vector replacing the old one, they are added to the dataframe. This is in contrast with what happen when explicit names are used.

Take for example this dataframe:

df = Daru::DataFrame.new({ :a => [1,2,3,4], :b => [5,6,7,8] })
=> #<Daru::DataFrame(4x2)>
       a   b
   0   1   5
   1   2   6
   2   3   7
   3   4   8

Assigning df[0] will add a new vector to the dataframe instead of replacing the 0th column:

df[0] = df[0] + 10; df
=> #<Daru::DataFrame(4x3)>
       a   b   0
   0   1   5  11
   1   2   6  12
   2   3   7  13
   3   4   8  14

This is surprising, considered that assigning to df[:a] replaces the :a column as expected:

 df[:a] = df[:a] + 10; df
=> #<Daru::DataFrame(4x2)>
       a   b
   0  11   5
   1  12   6
   2  13   7
   3  14   8

and that df[:a] and df[0] both return the same vector

df[:a]
=> #<Daru::Vector(4)>
       a
   0  1
   1  2
   2  3
   3  4
df[0]
=> #<Daru::Vector(4)>
       a
   0  1
   1  2
   2  3
   3  4
kojix2 commented 5 years ago

Hello. I am a Daru beginner too. Therefore, perhaps I may be wrong, but I will reply.

I used Daru enthusiastically for the past two weeks, I noticed that Daru has two different principles from Pandas.

  1. Vectors (columns) should always take priority over rows.
  2. You should call the vector/row by name or index rather than number.

Daru is not a matrix calculation library. Dataframes focuses on manipulating the series by name. The importance of naming is a part of Ruby's culture.

You should access vectors by index name.

df["name_of_vector"]

or column number

df.at(1)

Row You can access rows by index name.

df.row["name_of_row"]

or row number

df.row_at(1)

Somehow df[column_number] df.row[row_number] work too. But they are not recommend way. #[](*names) method is for names. Not for column number.

This may be the reasons for this strange behavior you wrote.

kojix2 commented 5 years ago
df = Daru::DataFrame.new({ :a => [1,2,3,4], :b => [5,6,7,8] })

df.set_at [0], (df.at(0) + 10)
df

Probably this is the correct way of writing, but I feel like being told that "Do not call Vector by the number of columns"