Closed wwfan91 closed 8 years ago
Hi,
these problems are bugs that come with the original MSNBC and IITB datasets. We programmed our adapters in a way that they are able to detect these positioning problems. However, since we want to offer reproducability, we can not fix these problems. Otherwise the new evaluation results wouldn't be comparable to the old results.
By the way, you might encounter problems like this in other datasets as well.
Cheers, Michael
Hi Michael, The problem is because there are so many mismatches, the annotator gives poor result when I benchmark locally. e.g. i tried to run xLisa with MSNBC dataset and got 0.1919 F1 while the result with online gerbil (http://gerbil.aksw.org/gerbil/) was 0.5537 I used the Gerbil 1.2.4 (Jul-1). I'm not sure if there is any modification made to the dataset adapter?
Now I understand your concerns. That is a huge different that shouldn't be there. However, I would like to make sure that the problem does not arose from the (still new) xLisa adapter. Could you please try the following solution: check out the current master (it is still 1.2.4 but with some hotfixes) and rerun your experiment. Maybe you will have to reduce the amount of time results are cached which can be done by reducing the time in this line: https://github.com/AKSW/gerbil/blob/master/src/main/properties/gerbil.properties#L38 Does the difference remain?
You can also try to benchmark a different system with the MSNBC and IITB datasets. Is there a difference between the results of your local system and the online instance?
The problem still remains. The worse results are also observed with others annotators like AIDA, Babelfy, DBpedia Spotlight, Dexter, WAT with MSNB dataset (the IITB is too big to wait for but I think it will have the same bug)
I just checked it with DBpedia Spotlight and the available cache files and found no difference.
Do you still have URI sameAs retrieval and entity checking disabled? (I think you wrote something like that in #163)
Yes. I download and replace the cache files mentioned in #163 For DBpedia Spotlight annotator + MSNB dataset: The local result is 0.1154 (Micro F1) while it's 0.418 with the online Gerbil. I use win10+cygwin but i don't think it makes the difference.
You don't need to download and use the cache files if you have deactivated the sameAs retrieval and the entity checking.
This difference in configuration could already explain the differences between your local GERBIL instance and the online instance. If I benchmark DBpedia Spotlight locally I get an Micro F1 score of 0.4566. It does not matter whether I download the cache files from the server or let my machine retrieve the information from the web, this value was the same through all of my test runs. However, if I deactivate the entity checking and sameAs retrieval, the Micro F1 score drops to 0.354.
From my point of view, you can not compare your results with the results of the online instance as long as you are not using the same configuration.
@wwfan91 Did you tested it again with enabled entity checking and sameAs retrieval? Did the problem still occurs or can we close the issue?
I tried it twice with the newest repository (only replaced the cache files #163). The thing is that many mismatch warnings reported even with small data set MSNBC and I think it causes the problem of low performance. Here is the console logs: https://www.dropbox.com/s/01tsjexl6ut7wvv/gerbil_log.txt?dl=0
Sorry, I missunderstood your first post as I thought that you only had 18 warnings. However, the log file you uploaded contains 619 of these warnings which is way too much. This also explains the low results on MSNBC since the dataset contains only 650 entities.
Where did you got the MSNBC files from? For some reason there seems to be a missmatch between the text files and the files containing the entity linking.
How do we want to proceed. Should I upload the MSNBC data we are using?
How many warnings do you get for IITB? What is the default character encoding your machine / JVM uses?
(ya, there are many mismatch warnings for the two datasets) The datasets were downloaded by the start.sh script. However it's zipped as gerbil_data.zip. I need to extract manually and put it into the gerbil dataset folder. I also think it's about the mismatch when reading raw text and annotation file. For other datasets, raw text+annotations are in single file so they don't have this problem. Can you upload the MSNBC and IITB dataset you are using to double check?
The JVM default encoding information is:
Default Charset=windows-1252 file.encoding=Latin-1 Default Charset=windows-1252 Default Charset in Use=Cp1252
The problem is caused by the usage of the default encoding during the loading of the datasets inside GERBIL.
Hotfix solution: please add -Dfile.encoding=UTF-8
to the command with which you are executing GERBIL.
@TortugaAttack can you please go through the places in which data is read (search for FileInputStreams or FileReaders - especially in the dataset adapters) and make sure that we are using UTF-8
for the transformation of read bytes to Strings? Note that a major difference between InputStreams and Reader classes is the usage of charsets. The InputStream only handles bytes and will never interpret them while the Reader class automatically uses the default encoding to transform the read data into Strings. That means that for our use case, InputStreams are good, Readers are bad (and should be avoided).
so instead of changing Reader to InputStream, it could also be done by explicit telling the Reader to use UTF-8
right?
btw. neither IITB nor MSNBC uses either of them. Will change that too.
Both datasets use the SAX XML parser and call it with an InputStream
https://github.com/AKSW/gerbil/blob/version1.2.5/src/main/java/org/aksw/gerbil/dataset/impl/msnbc/MSNBC_XMLParser.java#L59
https://github.com/AKSW/gerbil/blob/version1.2.5/src/main/java/org/aksw/gerbil/dataset/impl/iitb/IITB_XMLParser.java#L61
Please have a look whether it is possible to configure the encoding of the parser. Otherwise, we can try to use a Reader that is configured to use UTF-8 as you suggested.
strange. my eclipse search does not find it. But good to know.
I will have a look if the parser is configurable.
The other Readers i will change to use excplicit UTF-8
done with commit 55d110b will test it today evening, if everything works fine i will close the issue
Note that you might have to add -Dfile.encoding=Latin-1
to the execution. Otherwise you wouldn't encounter the problem even if you don't change the code :wink:
seems to work. debugging was a success. Further more MSNBC has the same results locally as online.
Did use -Dfile.encoding=Latin.1
.
I got this log when trying to benchmark the entity linking with MSNBC dataset
Similar issue happens while loading IITB:
How to solve the mismatch problem?