SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

Loss of DF multi-index when joining a DF without multi-index #429

Open info-rchitect opened 6 years ago

info-rchitect commented 6 years ago

Hi,

I am seeing a multi-index DF lose its multi-index (goes to 0,1,2,3...) when doing a left join with a DF without a multi-index. Is that the expected behavior?

[6] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.index
=> #<Daru::MultiIndex(26x3)>
 2017W27 MYPROD1       A
 2017W28 MYPROD1       A
 2017W29 MYPROD1       A
 2017W30 MYPROD1       A
 2017W31 MYPROD1       A
 2017W32 MYPROD1       A
 2017W33 MYPROD1       A
 2017W34 MYPROD1       A
 2017W35 MYPROD1       A
 2017W36 MYPROD1       A
 2017W37 MYPROD1       A
 2017W38 MYPROD1       A
 2017W39 MYPROD1       A
 2017W40 MYPROD1       A
 2017W41 MYPROD1       A
 2017W42 MYPROD1       A
 2017W43 MYPROD1       A
 2017W44 MYPROD1       A
 2017W45 MYPROD1       A
 2017W46 MYPROD1       A
     ...     ...     ...
[7] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index
=> #<Daru::Index(30): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... 29}>
[8] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
[9] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index
=> #<Daru::Index(26): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... 25}>

thx

zverok commented 6 years ago

I believe it is a bug, but can you please provide a self-containing example for easier reproduction?

info-rchitect commented 6 years ago

@zverok sorry for the delay, will get to this today.

info-rchitect commented 6 years ago

OK here we go:

[1] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data
=> #<Daru::DataFrame(4x7)>
                unit testA_ft1 testA_ws1 testB_ft1 testB_ws1 testC_ft1 testC_ws1
         1         1       nil       0.0       nil       0.0       nil       1.0
         2         2       0.0       nil       1.0       nil       0.0       nil
         3         3       nil       0.0       nil       0.0       nil       1.0
         4         4       0.0       nil       1.0       nil       0.0       nil
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj
=> #<Daru::DataFrame(4x4)>
                unit testA_ws1 testD_ws1 testE_ws1
         1         1       1.0       0.0       1.0
         2         2       0.0       1.0       0.0
         3         3       1.0       1.0       0.0
         4         4       0.0       1.0       0.0
[3] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data.index
=> #<Daru::MultiIndex(4x1)>
   1
   2
   3
   4
[4] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index
=> #<Daru::MultiIndex(4x1)>
   1
   2
   3
   4
[5] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
=> #<Daru::DataFrame(4x10)>
             testA_ft1 testA_ws1_  testB_ft1  testB_ws1  testC_ft1  testC_ws1       unit testA_ws1_  testD_ws1  testE_ws1
          0        nil        0.0        nil        0.0        nil        1.0          1        1.0        0.0        1.0
          1        0.0        nil        1.0        nil        0.0        nil          2        0.0        1.0        0.0
          2        nil        0.0        nil        0.0        nil        1.0          3        1.0        1.0        0.0
          3        0.0        nil        1.0        nil        0.0        nil          4        0.0        1.0        0.0
[6] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index
=> #<Daru::Index(4): {0, 1, 2, 3}>

In reality, it seems like any join between a multiindex DF and any type of indexed DF results in a Daru::Index DF. I believe this is related to the merge issue.

zverok commented 6 years ago

Please check if it works on current master should be solved by #431

info-rchitect commented 6 years ago

@zverok I am still seeing the join method turn 2 DFs with multiindex into a single index but my example perhaps is not the best due to the fact that it looks like daru is actually 'optimizing' away the multi-index.

[1] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data.index.to_a
=> [[1], [2], [3], [4]]
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index.to_a
=> [[1], [2], [3], [4]]
[3] pry(#<Nemawashi::Analytics::Transforms::Update>)>
    60:           binding.pry
    61:           updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
 => 62:           binding.pry
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index.to_a
=> [0, 1, 2, 3]

Is this intended behavior? I will try to find a better example where the pre-join index cannot be flattened to be the same as the post-join index.

zverok commented 6 years ago

Well, in fact, currently:

I have a suspicion that the most reasonable behavior would be "make join columns an index", but not sure it should/would/can work (see "duplicate index discussion" here ;)).

Anyways, I am pretty sure your initial request (preserving index of both dataframes on join) can't be performed. Why do you believe otherwise?

info-rchitect commented 6 years ago

@zverok if two dataframes have the same index shouldn't the index be preserved?

zverok commented 6 years ago

But why it should? Logically, "joining" is merging of different data sources: people (indexed by name) join departments (indexed by id), books (indexed by isbn) join authors (indexed by name)... What is the real-life use case for two dataframes that should be joined, but have the same index?

info-rchitect commented 6 years ago

@zverok so let's say my multi-index is [:work_week, :factory] and i pull yield data from one database and I pull volume forecast data from another database that uses the same multi-index of [:work_week, :factory] and want them to be joined. why should I lose the matching multi-index on the subsequently created dataframe?

zverok commented 6 years ago

By which column you join them, then?

info-rchitect commented 6 years ago

those 2 columns.

zverok commented 6 years ago

So, you join them on index?