[Bug]: "Unique rows (HashSet)" has a bug and drops records

gertwieland commented 4 weeks ago

Apache Hop version?

2.8

Java version?

openjdk version "11.0.21" 2023-10-17

Operating system

Windows

What happened?

"Unique rows (HashSet)" seems to drop records even if they only appear once. Steps to reproduce the error:

Generate 60k records, then add a sequence and one column with random fake data.

Then calculate a SHA256 checksum over it. Since it includes the sequence number from 1 - 60k, those checksums must be all unique.

But still, the "Unique rows (HashSet)" seems to consider one row a duplicate, and only returns 59,999 records.

Test pipeline attached Unique_Hash_Faulty.zip

Issue Priority

Priority: 3

Issue Component

Component: Hop Gui

DAJGIT commented 3 weeks ago

I could reproduce this case after several runs. Trying to catch the duplicate record found this option: Compare using stored row values It seems this solves this case.

DAJGIT commented 3 weeks ago

Keep diving looking for a repoduction path and here it is:

Unique_Hash_Faulty_Sample.zip

hansva commented 3 weeks ago

.take-issue

apache / hop