abhishek0007 / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Limited result when I use --noreindex in RecordLinkageMode #114

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Setting up two data-sources. In my case first - 600,000 items, second - 2 
items
2. Without --noreindex flag i see - 2 matches. With this one - 60 matches.

What is the expected output? What do you see instead?

In all cases I expect 60 matches.

What version of the product are you using? On what operating system?
1.0

Please provide any additional information below.

Original issue reported on code.google.com by vsv2711@gmail.com on 8 Apr 2013 at 5:15

GoogleCodeExporter commented 8 years ago
in second step I meant reverse logic. 2 matches with --noreindex

Original comment by vsv2711@gmail.com on 8 Apr 2013 at 5:21

GoogleCodeExporter commented 8 years ago
I have reproduced this now. Running the countries example I get 203 matches. 
Running it again with --noreindex I get 195 matches. It's 100% consistent every 
time I run it.

Original comment by lar...@gmail.com on 12 Apr 2013 at 6:47

GoogleCodeExporter commented 8 years ago
It seems like the difference is caused by the "matchall" parameter to 
Processor.linkRecords.

Basically, if you don't use --noreindex each record can match more than one 
record. If you do use --noreindex, each record can match only one record.

That seems to be consistent with your tests: With --noreindex you get one match 
for each of your two records. Without it you get on average 30 matches for each.

The solution here is clearly to let you control which behaviour you want, and 
to stop --noreindex interfering with that choice.

Original comment by lar...@gmail.com on 12 Apr 2013 at 6:52

GoogleCodeExporter commented 8 years ago
This issue was closed by revision e8f147f9d19c.

Original comment by lar...@gmail.com on 12 Apr 2013 at 7:05

GoogleCodeExporter commented 8 years ago
If you pull from Mercurial and rebuild you'll see 60 matches when running with 
--noreindex, and when running without. If you add --singlematch you'll see 2, 
regardless of whether you use --noreindex or not.

Original comment by lar...@gmail.com on 12 Apr 2013 at 7:08

GoogleCodeExporter commented 8 years ago
Thanks Lars!
This is quite major bug. Maybe you should bump version?

Original comment by vsv2711@gmail.com on 15 Apr 2013 at 8:34

GoogleCodeExporter commented 8 years ago
And can you tell me how i can make build?

Original comment by vsv2711@gmail.com on 15 Apr 2013 at 8:35

GoogleCodeExporter commented 8 years ago
It is a pretty major bug. We'll see what we do about versions.

Meanwhile, to build, do "mvn package", and you'll find the result in the 
"target/" directory. If you want, I can send you a .jar file instead.

Original comment by lar...@gmail.com on 16 Apr 2013 at 8:16