dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

Failed extraction due to IntermediateNodeMapping #568

Open Floghi opened 5 years ago

Floghi commented 5 years ago

Hello, first thanks for the great job you are doing with extraction-framework!

From the commit e0ce38ee8cadfba1d579963adc95cc9af42bb60c (15 nov 2018) and more precisely the modification line 46 in core/src/main/scala/org/dbpedia/extraction/mappings/IntermediateNodeMapping.scala

 //      if(valueNodes.forall(_.size <= 1))
 //        context.recorder[TemplateNode].record(new RecordEntry[TemplateNode](node, node.title.encoded, RecordSev    erity.Info, context.language, "IntermediateNodeMapping for multiple properties have multiple values in: " + subjec    tUri))
 +      if(valueNodes.forall(_.size <= 1))
 +        context.recorder[TemplateNode].record(new RecordEntry[TemplateNode](node, node.title.encoded, RecordSever    ity.Info, context.language, "IntermediateNodeMapping for multiple properties have multiple values in: " + subjectU    ri))

I get errors like this for each country, places in french dump Exception; fr; Main Extraction at 00:00.084s for 16 datasets; Main Extraction failed for instance http://fr.dbpedia.org/resource/Autriche: null

In attach a minimal dump (containing 2 pages from french wiki dump - "Autriche" badly extracted, and "Antoine Meillet" correctly extracted) and my properties file to easily reproduce.

By commenting back the lines, the extraction is back to normal. info.zip

Floghi commented 5 years ago

By debugging the full stack log - line 284 ExtractionRecorder.scala, i found that the datasets in case of IntermediateNodeMapping has size 1 and contain a null element only. An easy fix is to change the line as following but i dont know if it's not hidding a deeper problem

-    val datasetss = if(datasets.nonEmpty && datasets.size <= 3)
+    val datasetss = if(datasets.nonEmpty && datasets.size <= 3 && !datasets.exists(p => p == null))
JJ-Author commented 5 years ago

@Termilion can you please have a look at it? i also fixed something related for the importer #567 where the default dataset was overridden to null by some spark modification. @Floghi thanks for diving into that but unfortunately I can not really help with that. by chance: is it possible that something similiar to #567 (wrong construction of the Recorder) causes this?

Floghi commented 5 years ago

@JJ-Author Thanks for the insight, it wasn't far away from your fix #567

The real fix must be done in ConfigLoader.scala

--- a/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ConfigLoader.scala
+++ b/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ConfigLoader.scala
@@ -49,7 +49,8 @@ class ConfigLoader(config: Config)
     extractionRecorder.get(classTag[T]) match{
       case Some(s) => s.get(lang) match {
         case None =>
-          s(lang) = config.getDefaultExtractionRecorder[T](lang, 2000, null, null,  ListBuffer(dataset), extractionMonitor)
+          val datasetsParam = if (dataset == null) ListBuffer[Dataset]() else ListBuffer(dataset)
+          s(lang) = config.getDefaultExtractionRecorder[T](lang, 2000, null, null, datasetsParam, extractionMonitor)
           s(lang).asInstanceOf[ExtractionRecorder[T]]
         case Some(er) =>
           if(dataset != null) if(!er.datasets.contains(dataset)) er.datasets += dataset

when getExtractionRecorder was called with no dataset (default was null) or null specified, getDefaultExtractionRecorder was using simply ListBuffer(dataset) implying the issue later..

If it's confirmed by @Termilion you can probably fix it on master

JJ-Author commented 5 years ago

thanks @Floghi. @Termilion can you confirm that this is the fix

kurzum commented 5 years ago

@Floghi hi, we needed some time to improve the testing architecture to fix bugs such as yours. I just added your pages to the mini-frwiki.xml.bz2 dump here https://github.com/dbpedia/extraction-framework/tree/master/dump/src/test/resources some docu here https://forum.dbpedia.org/t/new-ci-tests-on-dbpedia-releases/77/3

Could you tell us which extractors you used or send us your config file?

Floghi commented 5 years ago

Hey @kurzum , thanks I will check your links,

my config file was in the zip attached in my first post, i.e.

log-dir=/model-quickstarter/wdir/log base-dir=/model-quickstarter/wdir/frFR wiki=fr locale=fr source=dump.xml require-download-complete=false languages=fr ontology=../ontology.xml mappings=../mappings uri-policy.uri=uri:en; generic:en; xml-safe-predicates:* format.nt.gz=n-triples;uri-policy.uri extractors=.RedirectExtractor,.DisambiguationExtractor,.MappingExtractor

Vehnem commented 3 years ago

http://dief.tools.dbpedia.org/server/extraction/fr/extract?title=Autriche&revid=&format=trix&extractors=mappings works here