cardillo / joinery

Data frames for Java
https://joinery.sh
GNU General Public License v3.0
695 stars 167 forks source link

extending join function #109

Open johnlak opened 2 years ago

johnlak commented 2 years ago

Hi

This a proposal for one change request and one feature request in dataframe join.

  1. modified join on to work with columns with same column names but different positions in the data frames

test code joining on column name "AA"

DataFrame<String> df1 = new DataFrame<String>();
df1.add("DD","AA","BB");
df1.append(Arrays.asList("d1","a1","b1"));
df1.append(Arrays.asList("d2","a2","b2"));
df1.append(Arrays.asList("d3","a3","b4"));

DataFrame<String> df2 = new DataFrame<String>();
df2.add("AA","CC");
df2.append(Arrays.asList("a1","c1"));
df2.append(Arrays.asList("a2","c2"));
df2.append(Arrays.asList("a4","c4"));

System.out.println(df1.joinOn(df2, DataFrame.JoinType.OUTER, "AA").resetIndex().toString());

output of 1.10 released version: join fails due to AA being in positions 1 in df1 and 0 in df2

    DD  AA_left BB  AA_right    CC
 0  d1  a1      b1                
 1  d2  a2      b2                
 2  d3  a3      b4                
 3                  a1          c1
 4                  a2          c2
 5                  a4          c4

output with proposed change:

    DD  AA_left BB  AA_right    CC
 0  d1  a1      b1  a1          c1
 1  d2  a2      b2  a2          c2
 2  d3  a3      b4                
 3                  a4          c4
  1. adding join on columns with different names

joining on column df3 "AA1" and df4 "AA2" (using proposed change)

DataFrame<String> df3 = new DataFrame<String>();
df3.add("DD","AA1","BB");
df3.append(Arrays.asList("d1","a1","b1"));
df3.append(Arrays.asList("d2","a2","b2"));
df3.append(Arrays.asList("d3","a3","b4"));

DataFrame<String> df4 = new DataFrame<String>();
df4.add("AA2","CC");
df4.append(Arrays.asList("a1","c1"));
df4.append(Arrays.asList("a2","c2"));
df4.append(Arrays.asList("a4","c4"));

System.out.println(df3.joinOn(df4, new String[] {"AA1"}, new String[] {"AA2"}, DataFrame.JoinType.OUTER).resetIndex().toString());

output with proposed change:

    DD  AA1 BB  AA2 CC
 0  d1  a1  b1  a1  c1
 1  d2  a2  b2  a2  c2
 2  d3  a3  b4        
 3              a4  c4

Kind regards John