Error loading annotations in IITB and MSNBC datasets caused by wrong encoding

wwfan91 commented 8 years ago

I got this log when trying to benchmark the entity linking with MSNBC dataset

2016-11-07 17:18:15,510 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (657, 9, "Allahabad", [http://en.wikipedia.org/wiki/Allahabad]) does not fit the surface form derived from the text " in Allah".>
2016-11-07 17:18:15,510 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (808, 16, "Hindu scriptures", [http://en.wikipedia.org/wiki/Hindu_scripture]) does not fit the surface form derived from the text "rom Hindu script".>
2016-11-07 17:18:15,510 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (992, 15, "Maha Kumbh Mela", [http://en.wikipedia.org/wiki/Kumbh_Mela]) does not fit the surface form derived from the text "he “Maha Kumb".>
2016-11-07 17:18:15,510 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1016, 22, "Great Pitcher Festival", [http://en.wikipedia.org/wiki/Great_Pitcher_Festival]) does not fit the surface form derived from the text " or the Great Pitcher ".>
2016-11-07 17:18:15,510 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1110, 6, "Ganges", [http://en.wikipedia.org/wiki/Ganges_River]) does not fit the surface form derived from the text " in th".>
2016-11-07 17:18:15,510 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1505, 10, "Kumbh Mela", [http://en.wikipedia.org/wiki/Kumbh_Mela]) does not fit the surface form derived from the text "uring the ".>
2016-11-07 17:18:15,510 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1523, 16, "Naba Kumar Ghosh", [http://en.wikipedia.org/wiki/Naba_Kumar_Ghosh]) does not fit the surface form derived from the text "la,▒? said Naba".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1597, 11, "West Bengal", [http://en.wikipedia.org/wiki/West_Bengal]) does not fit the surface form derived from the text "an state of".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1699, 10, "Shakuntala", [*null*]) does not fit the surface form derived from the text "ner self.▒".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1816, 14, "Madhya Pradesh", [http://en.wikipedia.org/wiki/Madhya_Pradesh]) does not fit the surface form derived from the text "Indian state o".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1847, 6, "Ganges", [http://en.wikipedia.org/wiki/Ganges_River]) does not fit the surface form derived from the text "to bat".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (1885, 10, "Kumbh Mela", [http://en.wikipedia.org/wiki/Kumbh_Mela]) does not fit the surface form derived from the text "s done at ".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (2055, 5, "Kumbh", [http://en.wikipedia.org/wiki/Kumbh_Mela]) does not fit the surface form derived from the text "be he".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (2071, 9, "Rama Devi", [http://en.wikipedia.org/wiki/Rama_Devi]) does not fit the surface form derived from the text "next 'Kum".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (2100, 9, "Allahabad", [http://en.wikipedia.org/wiki/Allahabad]) does not fit the surface form derived from the text "Devi, an ".>
2016-11-07 17:18:15,511 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (2166, 5, "Kumbh", [http://en.wikipedia.org/wiki/Kumbh_Mela]) does not fit the surface form derived from the text "has n".>
2016-11-07 17:18:15,512 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (2314, 11, "Indian army", [http://en.wikipedia.org/wiki/Indian_army]) does not fit the surface form derived from the text "d son, a so".>
2016-11-07 17:18:15,512 [pool-1-thread-6] WARN [org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset] - <In document http://MSNBC/Wor16447201.txt, the expected surface form of the named entity (2422, 9, "Allahabad", [http://en.wikipedia.org/wiki/Allahabad]) does not fit the surface form derived from the text "ers.

Similar issue happens while loading IITB:


2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "tnessed d" has an alphabetic character in front of it ("i").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "tnessed d" has an alphabetic character directly behind it ("i").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity " fat t" has an alphabetic character in front of it ("f").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity " fat t" starts with a whitespace.>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity " fat t" has an alphabetic character directly behind it ("o").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "ean prod" has an alphabetic character in front of it ("b").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "ean prod" has an alphabetic character directly behind it ("u").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "roble" has an alphabetic character in front of it ("p").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "roble" has an alphabetic character directly behind it ("m").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "the fat " has an alphabetic character directly behind it ("r").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity "the fat " ends with a whitespace.>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity " to t" has an alphabetic character in front of it ("e").>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity " to t" starts with a whitespace.>
2016-11-07 16:51:23,830 [pool-1-thread-12] WARN [org.aksw.gerbil.dataset.impl.iitb.IITBDataset] - <In document http://IITB/13Oct08AmitHealth12.txt, the named entity " to t" has an alphabetic character directly behind it ("h").>

How to solve the mismatch problem?

MichaelRoeder commented 8 years ago

Hi,

these problems are bugs that come with the original MSNBC and IITB datasets. We programmed our adapters in a way that they are able to detect these positioning problems. However, since we want to offer reproducability, we can not fix these problems. Otherwise the new evaluation results wouldn't be comparable to the old results.

By the way, you might encounter problems like this in other datasets as well.

Cheers, Michael

wwfan91 commented 8 years ago

Hi Michael, The problem is because there are so many mismatches, the annotator gives poor result when I benchmark locally. e.g. i tried to run xLisa with MSNBC dataset and got 0.1919 F1 while the result with online gerbil (http://gerbil.aksw.org/gerbil/) was 0.5537 I used the Gerbil 1.2.4 (Jul-1). I'm not sure if there is any modification made to the dataset adapter?

MichaelRoeder commented 8 years ago

Now I understand your concerns. That is a huge different that shouldn't be there. However, I would like to make sure that the problem does not arose from the (still new) xLisa adapter. Could you please try the following solution: check out the current master (it is still 1.2.4 but with some hotfixes) and rerun your experiment. Maybe you will have to reduce the amount of time results are cached which can be done by reducing the time in this line: https://github.com/AKSW/gerbil/blob/master/src/main/properties/gerbil.properties#L38 Does the difference remain?

You can also try to benchmark a different system with the MSNBC and IITB datasets. Is there a difference between the results of your local system and the online instance?

wwfan91 commented 8 years ago

The problem still remains. The worse results are also observed with others annotators like AIDA, Babelfy, DBpedia Spotlight, Dexter, WAT with MSNB dataset (the IITB is too big to wait for but I think it will have the same bug)

MichaelRoeder commented 8 years ago

I just checked it with DBpedia Spotlight and the available cache files and found no difference.

Do you still have URI sameAs retrieval and entity checking disabled? (I think you wrote something like that in #163)

wwfan91 commented 8 years ago

Yes. I download and replace the cache files mentioned in #163 For DBpedia Spotlight annotator + MSNB dataset: The local result is 0.1154 (Micro F1) while it's 0.418 with the online Gerbil. I use win10+cygwin but i don't think it makes the difference.

MichaelRoeder commented 8 years ago

You don't need to download and use the cache files if you have deactivated the sameAs retrieval and the entity checking.

This difference in configuration could already explain the differences between your local GERBIL instance and the online instance. If I benchmark DBpedia Spotlight locally I get an Micro F1 score of 0.4566. It does not matter whether I download the cache files from the server or let my machine retrieve the information from the web, this value was the same through all of my test runs. However, if I deactivate the entity checking and sameAs retrieval, the Micro F1 score drops to 0.354.

From my point of view, you can not compare your results with the results of the online instance as long as you are not using the same configuration.

MichaelRoeder commented 8 years ago

@wwfan91 Did you tested it again with enabled entity checking and sameAs retrieval? Did the problem still occurs or can we close the issue?

wwfan91 commented 8 years ago

I tried it twice with the newest repository (only replaced the cache files #163). The thing is that many mismatch warnings reported even with small data set MSNBC and I think it causes the problem of low performance. Here is the console logs: https://www.dropbox.com/s/01tsjexl6ut7wvv/gerbil_log.txt?dl=0

MichaelRoeder commented 8 years ago

Sorry, I missunderstood your first post as I thought that you only had 18 warnings. However, the log file you uploaded contains 619 of these warnings which is way too much. This also explains the low results on MSNBC since the dataset contains only 650 entities.

Where did you got the MSNBC files from? For some reason there seems to be a missmatch between the text files and the files containing the entity linking.

How do we want to proceed. Should I upload the MSNBC data we are using?

How many warnings do you get for IITB? What is the default character encoding your machine / JVM uses?

wwfan91 commented 8 years ago

(ya, there are many mismatch warnings for the two datasets) The datasets were downloaded by the start.sh script. However it's zipped as gerbil_data.zip. I need to extract manually and put it into the gerbil dataset folder. I also think it's about the mismatch when reading raw text and annotation file. For other datasets, raw text+annotations are in single file so they don't have this problem. Can you upload the MSNBC and IITB dataset you are using to double check?

The JVM default encoding information is:

Default Charset=windows-1252 file.encoding=Latin-1 Default Charset=windows-1252 Default Charset in Use=Cp1252

MichaelRoeder commented 8 years ago

The problem is caused by the usage of the default encoding during the loading of the datasets inside GERBIL.

Hotfix solution: please add -Dfile.encoding=UTF-8 to the command with which you are executing GERBIL.

@TortugaAttack can you please go through the places in which data is read (search for FileInputStreams or FileReaders - especially in the dataset adapters) and make sure that we are using UTF-8 for the transformation of read bytes to Strings? Note that a major difference between InputStreams and Reader classes is the usage of charsets. The InputStream only handles bytes and will never interpret them while the Reader class automatically uses the default encoding to transform the read data into Strings. That means that for our use case, InputStreams are good, Readers are bad (and should be avoided).

TortugaAttack commented 8 years ago

so instead of changing Reader to InputStream, it could also be done by explicit telling the Reader to use UTF-8 right?

btw. neither IITB nor MSNBC uses either of them. Will change that too.

MichaelRoeder commented 8 years ago

Both datasets use the SAX XML parser and call it with an InputStream https://github.com/AKSW/gerbil/blob/version1.2.5/src/main/java/org/aksw/gerbil/dataset/impl/msnbc/MSNBC_XMLParser.java#L59 https://github.com/AKSW/gerbil/blob/version1.2.5/src/main/java/org/aksw/gerbil/dataset/impl/iitb/IITB_XMLParser.java#L61

Please have a look whether it is possible to configure the encoding of the parser. Otherwise, we can try to use a Reader that is configured to use UTF-8 as you suggested.

TortugaAttack commented 8 years ago

strange. my eclipse search does not find it. But good to know. I will have a look if the parser is configurable. The other Readers i will change to use excplicit UTF-8

TortugaAttack commented 8 years ago

done with commit 55d110b will test it today evening, if everything works fine i will close the issue

MichaelRoeder commented 8 years ago

Note that you might have to add -Dfile.encoding=Latin-1 to the execution. Otherwise you wouldn't encounter the problem even if you don't change the code :wink:

TortugaAttack commented 8 years ago

seems to work. debugging was a success. Further more MSNBC has the same results locally as online.

Did use -Dfile.encoding=Latin.1.

dice-group / gerbil

Error loading annotations in IITB and MSNBC datasets caused by wrong encoding #165