dkpro / dkpro-jwktl

Java Wiktionary Library
http://dkpro.org/dkpro-jwktl/
Apache License 2.0
57 stars 25 forks source link

The number of parsed senses seems very small #40

Closed rabravo closed 7 years ago

rabravo commented 7 years ago

Hi dkpro-jwktl team,

I git clone the project dkpro-jwktl and I was able to parse the following Wiktionary dump, enwiki-20170301-pages-articles.xml.bz2, without a problem, after adding two instructions to the XMLDumpParser in the private SAXParserFactory getParserFactory() method that increase the number of entries. Before adding these instructions, the libraries had thrown an Exception after parsing 650,000 entries. Here are the additional instructions that resolve this problem (this solution I found from another thread),

//Original instruction //return SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null));

System.setProperty("jdk.xml.totalEntitySizeLimit", "1500000000"); SAXParserFactory spf = SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null); spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false); return spf;

After modifying the code, the parsing of the dump finished correctly with no Exceptions or Errors. However, after executing one of the examples, Example3_IterateEntries.java the output showed the following results,

Pages: 10117574 Entries: 3776520 Senses: 986

The output seems short for the number of available senses since I presume these number should be the largest of the three or at least equal to the number of pages/entries. I also tried the examples from Word Senses suggested in

https://dkpro.github.io/dkpro-jwktl/documentation/architecture/

with the word "Boat" (certainly many more instances) and I got a IndexOutOfBoundsException . Do you have any idea why the libraries are not capturing enough number of senses? Finally, If I were to use "boat" I got a NullPointerException. Thank you in advance for any help.

chmeyer commented 7 years ago

I tried an English dump from Feb 1, 2017 just recently and got

database.pages=5085081 database.entries=5721239 database.sense=13319857

I also tried parsing the most recent "boat" article page, which worked fine. Given that your number of pages are a lot higher than expected (5 mio vs. 10 mio), I assume that there is an error with your dump file. If I assume that you didn't change the file name of your dump, the problem seems to be that you are trying to parse a WikiPEDIA dump file with the JWKTL WikTIONARY library. Depending on your goal, please download the enwiktionary-... dump or take a look at the https://dkpro.github.io/dkpro-jwpl/ library. Please reopen this issue if it is likely that the problem is somewhere else.

rabravo commented 7 years ago

@chmeyer , your inference was correct. The dump I was using is the enwiki... which is a Wikipedia dump. This solves the mystery. All seems to work as it should. Thank you for taking the time to address my questions.