Gmousse / dataframe-js

No Maintenance Intended
https://gmousse.gitbooks.io/dataframe-js/
MIT License
460 stars 38 forks source link

[BUG] Inconsistent left join results #123

Open earlmedina opened 3 years ago

earlmedina commented 3 years ago

Describe the bug I've been working on porting some logic from pandas to dataframe-js and came across an inconsistency in left joins (I suspect the problem may be seen in right joins as well). I am finding that dataframe-js does not produce consistent left joins, sometimes introducing duplicates that should not be in the join result. In my case, the duplication inserts approximately 50 duplicates which renders the result unusable for my purposes unless I drop duplicates.

To Reproduce Steps to reproduce the behavior: Run the below sample code. The expectation is that the join result would have 7 rows, but I come out with 8 rows - in this example there are duplicates for A: 1.

Note that column "A" is used in the join...dfA has 7 records, dfB has 4 records. There are no duplicate A values in dfB.

      const jsonDataA = [{ A: 1, B: 4.28283, C: -1.509, D: -1.1352 },
                  { A: 2, B: -0.22863, C: -3.39059, D: 1.1632 },
                  { A: 3, B: -0.82863, C: -1.5059, D: 2.1352 },
                  { A: 4, B: -1.28863, C: 4.5059, D: 4.1632 },
                  { A: 5, B: -1.28863, C: 4.5059, D: 4.1632 },
                  { A: 6, B: -1.28863, C: 4.5059, D: 4.1632 },
                  { A: 7, B: -1.28863, C: 4.5059, D: 4.1632 }];

      const jsonDataB = [{ A: 1, xb: 4.28283, B: null, C: -1.509, D: -1.1352 },
                  { A: 2, xb: null, B: -0.22863, C: -3.39059, D: 1.1632 },
                  { A: 3, xb: null, B: -0.82863, C: -1.5059, D: 2.1352 },
                  { A: 4, xb: null, B: -1.28863, C: 4.5059, D: 4.1632 }];
      const dfA = new DataFrame(jsonDataA);
      const dfB = new DataFrame(jsonDataB);
      const dfC = dfA.join(
        dfB,
        "A",
        "left"
      );
      console.log('TEST', dfC);

image

Expected behavior A left join should produce a dataframe with 7 rows, but the result contains duplicates.

Desktop (please complete the following information):