Open Floghi opened 5 years ago
By debugging the full stack log - line 284 ExtractionRecorder.scala, i found that the datasets in case of IntermediateNodeMapping has size 1 and contain a null element only. An easy fix is to change the line as following but i dont know if it's not hidding a deeper problem
- val datasetss = if(datasets.nonEmpty && datasets.size <= 3)
+ val datasetss = if(datasets.nonEmpty && datasets.size <= 3 && !datasets.exists(p => p == null))
@Termilion can you please have a look at it? i also fixed something related for the importer #567 where the default dataset was overridden to null by some spark modification. @Floghi thanks for diving into that but unfortunately I can not really help with that. by chance: is it possible that something similiar to #567 (wrong construction of the Recorder) causes this?
@JJ-Author Thanks for the insight, it wasn't far away from your fix #567
The real fix must be done in ConfigLoader.scala
--- a/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ConfigLoader.scala
+++ b/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ConfigLoader.scala
@@ -49,7 +49,8 @@ class ConfigLoader(config: Config)
extractionRecorder.get(classTag[T]) match{
case Some(s) => s.get(lang) match {
case None =>
- s(lang) = config.getDefaultExtractionRecorder[T](lang, 2000, null, null, ListBuffer(dataset), extractionMonitor)
+ val datasetsParam = if (dataset == null) ListBuffer[Dataset]() else ListBuffer(dataset)
+ s(lang) = config.getDefaultExtractionRecorder[T](lang, 2000, null, null, datasetsParam, extractionMonitor)
s(lang).asInstanceOf[ExtractionRecorder[T]]
case Some(er) =>
if(dataset != null) if(!er.datasets.contains(dataset)) er.datasets += dataset
when getExtractionRecorder was called with no dataset (default was null) or null specified, getDefaultExtractionRecorder was using simply ListBuffer(dataset) implying the issue later..
If it's confirmed by @Termilion you can probably fix it on master
thanks @Floghi. @Termilion can you confirm that this is the fix
@Floghi hi, we needed some time to improve the testing architecture to fix bugs such as yours. I just added your pages to the mini-frwiki.xml.bz2 dump here https://github.com/dbpedia/extraction-framework/tree/master/dump/src/test/resources some docu here https://forum.dbpedia.org/t/new-ci-tests-on-dbpedia-releases/77/3
Could you tell us which extractors you used or send us your config file?
Hey @kurzum , thanks I will check your links,
my config file was in the zip attached in my first post, i.e.
log-dir=/model-quickstarter/wdir/log base-dir=/model-quickstarter/wdir/frFR wiki=fr locale=fr source=dump.xml require-download-complete=false languages=fr ontology=../ontology.xml mappings=../mappings uri-policy.uri=uri:en; generic:en; xml-safe-predicates:* format.nt.gz=n-triples;uri-policy.uri extractors=.RedirectExtractor,.DisambiguationExtractor,.MappingExtractor
Hello, first thanks for the great job you are doing with extraction-framework!
From the commit e0ce38ee8cadfba1d579963adc95cc9af42bb60c (15 nov 2018) and more precisely the modification line 46 in core/src/main/scala/org/dbpedia/extraction/mappings/IntermediateNodeMapping.scala
I get errors like this for each country, places in french dump Exception; fr; Main Extraction at 00:00.084s for 16 datasets; Main Extraction failed for instance http://fr.dbpedia.org/resource/Autriche: null
In attach a minimal dump (containing 2 pages from french wiki dump - "Autriche" badly extracted, and "Antoine Meillet" correctly extracted) and my properties file to easily reproduce.
By commenting back the lines, the extraction is back to normal. info.zip