Texera / texera

Collaborative Machine-Learning-Centric Data Analytics Using Workflows
https://texera.github.io
Apache License 2.0
161 stars 68 forks source link

HashJoin Anomaly when joining outputs of two Python UDFs #2402

Open bobbai00 opened 6 months ago

bobbai00 commented 6 months ago

I have two Python UDFs, their columns are: 1st one:

total_mortality_RR_2020 state state_num
double string number

2nd one:

total_mortality_RR_2019 state state_num
double string number

Before I connected them to the HashJoin, their outputs are:

Screenshot 2024-02-22 at 2 38 41 PM

After I connected them directly to the HashJoin(inner join on state_num of integer type):

Screenshot 2024-02-22 at 2 36 49 PM

the tuple mapping is 0 -> 0 for the HashJoin operator, and one of the PythonUDF tuple mapping becomes 50->25 instead of 50->50.

And if I introduced two type-cast operators, convert the state_num column from type integer to string

Screenshot 2024-02-22 at 2 43 55 PM

the HashJoin has the pair 50->0

When changing the join type from inner join to full outerjoin, the pair becomes 50->50, but the result is not correct:

Screenshot 2024-02-22 at 2 48 18 PM Screenshot 2024-02-22 at 2 49 21 PM

The workflow link is: https://texera.ics.uci.edu/workflow/1642

Yicong-Huang commented 6 months ago

@bobbai00 can you test it again on the current master, to see if we still have this issue?