Open liehe opened 7 years ago
The ARGMAX results represent the "Local Mention" prior and they should be much higher cf Table 3 in our paper. What p(e|m) indexes do you use ? Are you using the ones that we provided, to be found here: https://polybox.ethz.ch/index.php/s/IOWjGrU3mjyzDSV/authenticate
It seems there is a big overlap between PBOH and LocalMention (the "common Loopy - ARGMAX" part). I will try to re-run it tonight on a fresh machine if you still cannot solve this issue. Can you please send me your full output log file by e-mail ?
Hi, thanks for your fast response.
I have not changed the method or the index itself. All I have changed is updating the index address in code, add UTF-8 encoding when using Source.fromFile(), and the AIDA dataset name (The one given in AIDA.scala is "testa_testb_aggregate" which I didn't a file with this name so I used the output file from "aida-yago2-dataset.jar" ). Also, I only ran the AIDA test A and ignored all the other dataset to save time.
I am going run it again to see if the result is the same. If so, I will send you the output log.
I am not sure what is the output file from "aida-yago2-dataset.jar", but your testa_testb_aggregate should contain the AIDA-A and AIDA-B datasets and be generated as described on the MPI website. It should look as follows (sorry, it has a license from MPI and I cannot upload the full file myself). One word per each line, with annotations when the word is part of a mention, tab separated:
-DOCSTART- (947testa CRICKET)
CRICKET
-
LEICESTERSHIRE B LEICESTERSHIRE Leicestershire_County_Cricket_Club http://en.wikipedia.org/wiki/Leicestershire_County_Cricket_
Club 1622318 /m/05hf4j
TAKE
OVER
AT
TOP
AFTER
INNINGS
VICTORY
.
LONDON B LONDON London http://en.wikipedia.org/wiki/London 17867 /m/04jpl
1996-08-30
West B West Indian West_Indies_cricket_team http://en.wikipedia.org/wiki/West_Indies_cricket_team 3379941 /m/098knd
Indian I West Indian West_Indies_cricket_team http://en.wikipedia.org/wiki/West_Indies_cricket_team 3379941 /m/098knd
all-rounder
Phil B Phil Simmons Phil_Simmons http://en.wikipedia.org/wiki/Phil_Simmons 2518836 /m/07kgj4
Simmons I Phil Simmons Phil_Simmons http://en.wikipedia.org/wiki/Phil_Simmons 2518836 /m/07kgj4
took
four
for
38
on
Friday
as
Leicestershire B Leicestershire Leicestershire_County_Cricket_Club http://en.wikipedia.org/wiki/Leicestershire_County_Cricket_
Club 1622318 /m/05hf4j
beat
Somerset B Somerset Somerset_County_Cricket_Club http://en.wikipedia.org/wiki/Somerset_County_Cricket_Club 162
2178 /m/05hdty
by
an
innings
and
39
runs
in
two
days
to
take
over
at
the
head
of
the
county
championship
.
Their
stay
on
top
,
though
,
may
be
short-lived
as
title
rivals
Essex B Essex Essex_County_Cricket_Club http://en.wikipedia.org/wiki/Essex_County_Cricket_Club 1622252 /m/05hdzj
,
Derbyshire B Derbyshire Derbyshire_County_Cricket_Club http://en.wikipedia.org/wiki/Derbyshire_County_Cricket_Club 182
9984 /m/05_blf
and
My dataset do have these lines, so the dataset should be fine.
Something is clearly wrong with the p(e|m) index that you use. "perc missing mentions from index : 14.97" is the percentange of mentions m that are not found in the dictionary, while "perc missing entities from mention index : 17.02" is the percentage of gold entities that do not appear in the respective mention entry. These should be together less than 5%. Looking at your log file I see that even common names like "kurdish", "tunisia" or "boston" are missing. Can you please check if they appear in your p(e|m) file (called mek-top-freq-crosswikis-plus-wikipedia-lowercase-top64.txt which should be constructed as a concatenation of the 2 files mek-top-freq-crosswikis-plus-wikipedia-lowercase-top64.txt.part_a*) ?
This should give a non-empty output:
cat mek-top-freq-crosswikis-plus-wikipedia-lowercase-top64.txt.part_a* | grep -P '^boston\t' | more
namely:
boston 10 10 113041 24437894,85112 167665,4306 65194,3140 43376,1974 69523,691 4339,596 23017869,45
1 882398,417 201767,383 182265,347 730207,324 126401,318 18346514,285 2323878,278 83622,264 550
3022,263 2004519,247 513495,225 1080900,217 211579,207 10128235,199 23876058,190 5637547,179 4608353,159
876997,158 6721569,150 26372818,150 2593807,142 1692165,139 82254,138 2338329,127 110372,127 288
00877,127 1423832,127 24239512,118 1843613,117 8871435,115 4319938,105 12195659,104 112680,102 206780,10061114,100 621979,98 3387737,92 13436708,91 13832944,90 117682,88 21441922,86 4584803,86 5889873,82230105,68 8305575,67 30055728,60 213128,60 1277059,60 148265,59 1294905,56 23911424,56 206779,53 298047,51 11773015,50 10982304,50 12202877,45 25409421,45
You need to create a new file containing the contents of both files mek-top-freq-crosswikis-plus-wikipedia-lowercase-top64.txt.part_a*, and update its path here: https://github.com/dalab/pboh-entity-linking/blob/master/src/main/scala/index/AllIndexesBox.scala#L19 . Similarly, you need to update the paths of all other index files listed in the same scala file. Let me know if it works.
Thank you so much. It is hard for me to target the problem. I will check the indexes.
According to the paper, the performance PBoH on AIDA test A is 86.63/85.48. Due to the upgrade of gerbil, the performance of PBoH is give here is 75.19/73.3.
However, when try to reproduce the result, it gives the following result (64.84/64.32).
I changed from
val file = "/media/hofmann-scratch/Octavian/entity_linking/marinah/AIDA/testa_testb_aggregate"
to "AIDA-YAGO2-dataset.tsv" which is generated by files downloaded from MPI-info.
I use
java -Xmx90g -cp target/PBoH-1.0-SNAPSHOT-jar-with-dependencies.jar el.EL_LBP_Spark testPBOHOnAllDatasets max-product
to run the code because the command
scala -J-Xmx90g target/PBoH-1.0-SNAPSHOT-jar-with-dependencies.jar testPBOHOnAllDatasets max-product
will generate a UnstaisfiedLinkError when it trys to use leveldbjni.
Did I made any mistakes in the process? How can I reproduce the result in Gerbil?
Thanks.