cardillo / joinery

Data frames for Java
https://joinery.sh
GNU General Public License v3.0
702 stars 167 forks source link

How could I join two data frames by specfying columns from each data frame ? #79

Open DareUrDream opened 5 years ago

DareUrDream commented 5 years ago

Hi,

I have a situation where there are two data frames with no common columns. How can I join them ? I want to join them with every other column one after another to produce various outputs.

Is it possible to join two DF's by specifying the mapping column/s from each DF ?

Cheers, DareUrDream

cardillo commented 5 years ago

Try using the joinOn method with the column names for which the values match. Alternatively, you can use the joinOn method with a function that computes the join key.

DareUrDream commented 5 years ago

Hi @cardillo ,

I have achieved it for the time being by renaming columns in one of the data sets. But then I have hit another bottle neck. Below is the stack trace. I am not sure how to prepare a unique key now so that the join works.

resource.txt agentstatedetail1m_copy.txt

Stack trace

Exception in thread "main" java.lang.IllegalArgumentException: generated key is not unique: [3] at joinery.impl.Combining.join(Combining.java:45) at joinery.impl.Combining.joinOn(Combining.java:102) at joinery.DataFrame.joinOn(DataFrame.java:730) at joinery.DataFrame.joinOn(DataFrame.java:756) at com.cisco.evaluate.joinery.JoineryTestMain.startEvaluation(JoineryTestMain.java:37) at com.cisco.evaluate.joinery.JoineryTestMain.main(JoineryTestMain.java:18)

Code Below

`DataFrame rsrcDf = DataFrame.readCsv(ClassLoader.getSystemResourceAsStream("resource.csv")) .retain("resourceid", "resourceloginid", "resourcename", "resourcegroupid", "extension", "resourceskillmapid", "assignedteamid", "resourcefirstname", "resourcelastname");

    DataFrame<Object> asdDf = DataFrame.readCsv(ClassLoader.getSystemResourceAsStream("agentstatedetail1m_copy.csv")).retain("agentid", "eventtype");

    asdDf = asdDf.rename("agentid", "resourceid");

    DataFrame<Object> joinedDf = asdDf.joinOn(rsrcDf, JoinType.LEFT, "resourceid");
    System.out.println("Final row count: " + joinedDf);`

Cheers, DareUrDream