dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
852 stars 270 forks source link

Anchors starting with # (fragment-only) in documents cause document parsing to be aborted #433

Closed praveenbalaji-blippar closed 2 years ago

praveenbalaji-blippar commented 8 years ago

When I try to run the Dbpedia extractor (master), I saw these kinds of messages printed:

WARNING: error processing page 'title=Cola;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty

This causes the page on "Cola" to be completely ignored. This is the offending line in the document:

{{See also|#Brands of Cola|label 1=Cola brands (shown below)|Category:Cola brands}}

Turns out WikiTitle dislikes #Brands of Cola.

Also, line 136 in WikiTitle.scala

var parts = decoded.split(":", -1))

uses split which returns a non-empty array for an empty string. I feel this may be unintended (See comment: //Check if this is an interlanguage link (beginning with ':')).

Links like this should be using the current title being parsed, maybe? If someone can comment on this, I can see if I can propose a fix.

Thanks Praveen

praveenbalaji-blippar commented 8 years ago

Here are some more:

WARNING: error processing page 'title=Babylon 5;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Cola;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Delaware;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Electronic music;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=E (mathematical constant);ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Exoplanet;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Liberation Tigers of Tamil Eelam;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Malta;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Reincarnation;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Tobin tax;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Velar consonant;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.impl.simple.TooManyErrorsException: Too many errors at '| {{IPA|[}}{{IPA bold dark red|kʼ}}{{IPA|an]}}' (line: 95)
WARNING: error processing page 'title=White dwarf;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=2010s;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Thaana;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.impl.simple.TooManyErrorsException: Too many errors at '|}' (line: 146)
WARNING: error processing page 'title=Meteoroid;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Dental consonant;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.impl.simple.TooManyErrorsException: Too many errors at '|{{IPA|[ukʼúk}}<span style="color:#700000">'''{{IPA|ǀ}}'''</span>{{IPA|ola]}}' (line: 122)
WARNING: error processing page 'title=Ainu language;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.impl.simple.TooManyErrorsException: Too many errors at '! {{IPA|[k ...38 chars omitted... }} !! {{IPA|[kaʊ]}} !! {{IPA|[kiʊ]}} !! {{IPA|[keʊ]}} !! {{IPA|[koʊ]}} !! {{IPA|[keɪ]}}' (line: 306)
WARNING: error processing page 'title=Dynamic random-access memory;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Same-sex marriage;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Exponentiation;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Music sequencer;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Malayalam script;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=National Park Service;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Al Sharpton;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Large numbers;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=List of Laos-related topics;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Morchella;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Non-native pronunciations of English;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.impl.simple.TooManyErrorsException: Too many errors at '* In Spani ...211 chars omitted... ounced {{IPA|[e̞sˈto̞mp]}} rather than {{IPA|[stɒmp]}}.<ref name="Goldstein 2005 203"/>' (line: 138)
WARNING: error processing page 'title=Confidence interval;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Eleftherios Venizelos;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Israeli West Bank barrier;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=List of ethnic, regional, and folk dances by origin;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=Open access;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
WARNING: error processing page 'title=List of prime numbers;ns=0/Main/;language:wiki=en,locale=en': org.dbpedia.extraction.wikiparser.WikiParserException: page name must not be empty
jimkont commented 8 years ago

thanks for catching this @praveenbalaji-blippar I temporarily fixed this in a higher level than the WikiTitle parser. To go that low we need a way to pass the current page as context and will require a lot changes.

In the next full extraction we will run, we will see the extend of the problem from the logs and decide how to proceed. if you do your own extraction, can you re-run and share the error logs?

praveenbalaji-blippar commented 8 years ago

Thanks @jimkont. I'll rerun the extraction and post logs.

Cheers Praveen

On Fri, Feb 5, 2016, 12:00 AM Dimitris Kontokostas notifications@github.com wrote:

thanks for catching this @praveenbalaji-blippar https://github.com/praveenbalaji-blippar I temporarily fixed this in a higher level than the WikiTitle parser. To go that low we need a way to pass the current page as context and will require a lot changes.

In the next full extraction we will run, we will see the extend of the problem from the logs and decide how to proceed. if you do your own extraction, can you re-run and share the error logs?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/extraction-framework/issues/433#issuecomment-180242995 .