alexmy21 / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

[DataMachine] Problems with title hashing in the DataMachine #91

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Reported on the user list: 
https://groups.google.com/forum/?fromgroups#!topic/jwpl/TDD05oG7Xho

When I use the datamachine to convert wikidump to txt I found the redirect page 
"Mass Transit Railway" will be incorrectly linked to "N63". I tracked the codes 
and found that the redirect page of "Mass Transit Railway" is "MTR". The 
datamachine hashed this string to a hash code (76683), which is conflict with 
"N63". Does anyone has a solution to this problem?

Original issue reported on code.google.com by oliver.ferschke on 3 May 2012 at 9:51

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r576.

Original comment by oliver.ferschke on 3 May 2012 at 9:53

GoogleCodeExporter commented 9 years ago
We should not hash the titles.

Solution:
In applicationContext.xml , change
<bean id="dumpVersionFactory" 
class="de.tudarmstadt.ukp.wikipedia.datamachine.dump.version.SingleDumpVersionJD
KIntKeyFactory" scope="singleton" />

to

<bean id="dumpVersionFactory" 
class="de.tudarmstadt.ukp.wikipedia.datamachine.dump.version.SingleDumpVersionJD
KStringKeyFactory" scope="singleton" />

This should fix the issue for all databases created in the future.
There is no fix for existing databases

This has been fixed in r576

Original comment by oliver.ferschke on 3 May 2012 at 9:55

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 15 Aug 2012 at 9:21