dbpedia / distributed-extraction-framework

DBpedia Distributed Extraction Framework: Extract structured data from Wikipedia in a parallel, distributed manner
41 stars 17 forks source link

Add progress logging at the end of the job to print node-wise statistics #29

Open nilesh-c opened 10 years ago

nilesh-c commented 10 years ago

Currently the final lines of the extraction output look like:

Jul 09, 2014 5:19:20 AM org.dbpedia.extraction.mappings.DistRedirects$ load
INFO: Will extract redirects from source for li wiki, could not load cache file '/home/nilesh/gsoc14/out10/liwiki/20140410/liwiki-20140410-template-redirects.obj': java.io.FileNotFoundException: File /home/nilesh/gsoc14/out10/liwiki/20140410/liwiki-20140410-template-redirects.obj does not exist
Jul 09, 2014 5:19:20 AM org.dbpedia.extraction.mappings.DistRedirects$ loadFromRDD
INFO: Loading redirects from source (li)
14/07/09 05:19:20 INFO DBpediaJobProgressListener: Started job #0
14/07/09 05:19:20 INFO DBpediaJobProgressListener: Stage #0: Starting stage collectAsMap at DistRedirects.scala:149 with 8 tasks at 00:00.000s
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #0 on host archangel-lapi7, executor 1 at 00:05.815s. Total tasks submitted: 1
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #1 on host archangel-lapi7, executor 1 at 00:05.833s. Total tasks submitted: 2
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #2 on host archangel-lapi7, executor 1 at 00:05.834s. Total tasks submitted: 3
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #3 on host archangel-lapi7, executor 1 at 00:05.834s. Total tasks submitted: 4
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #4 on host archangel-lapi7, executor 1 at 00:05.835s. Total tasks submitted: 5
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #5 on host archangel-lapi7, executor 1 at 00:05.836s. Total tasks submitted: 6
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #6 on host archangel-lapi7, executor 1 at 00:05.837s. Total tasks submitted: 7
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #7 on host archangel-lapi7, executor 1 at 00:05.838s. Total tasks submitted: 8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #0 at 00:13.276s. Completed: 1/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #6 at 00:13.416s. Completed: 2/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #1 at 00:13.430s. Completed: 3/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #5 at 00:13.556s. Completed: 4/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #3 at 00:13.567s. Completed: 5/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #7 at 00:13.665s. Completed: 6/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #4 at 00:13.683s. Completed: 7/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #2 at 00:13.791s. Completed: 8/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished stage collectAsMap at DistRedirects.scala:149 at 00:13.800s
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Finished job #0
Jul 09, 2014 5:19:31 AM org.dbpedia.extraction.mappings.DistRedirects$ loadFromRDD
INFO: Redirects loaded from source (li)
Jul 09, 2014 5:19:31 AM org.dbpedia.extraction.mappings.DistRedirects$ load
INFO: 101 redirects written to cache file /home/nilesh/gsoc14/out10/liwiki/20140410/liwiki-20140410-template-redirects.obj
Jul 09, 2014 5:19:32 AM org.dbpedia.extraction.dump.extract.DistExtractionJob run
INFO: li: 14 extractors (ArticleCategoriesExtractor,ArticleTemplatesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,GeoExtractor,InterLanguageLinksExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,RedirectExtractor,RevisionIdExtractor,ProvenanceExtractor,SkosCategoriesExtractor,ArticlePageExtractor), 14 datasets (page_links,revision_ids,page_ids,revision_uris,article_categories,skos_categories,labels,wikipedia_links,external_links,redirects,geo_coordinates,article_templates,category_labels,interlanguage_links) started
Jul 09, 2014 5:19:32 AM org.dbpedia.extraction.dump.extract.DistExtractionJob run
INFO: Writing outputs to destination...
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Started job #1
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Starting stage saveAsNewAPIHadoopFile at DistDeduplicatingWriterDestination.scala:35 with 8 tasks at 00:00.000s
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #8 on host archangel-lapi7, executor 1 at 00:14.756s. Total tasks submitted: 1
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #9 on host archangel-lapi7, executor 1 at 00:14.757s. Total tasks submitted: 2
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #10 on host archangel-lapi7, executor 1 at 00:14.758s. Total tasks submitted: 3
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #11 on host archangel-lapi7, executor 1 at 00:14.759s. Total tasks submitted: 4
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #12 on host archangel-lapi7, executor 1 at 00:14.760s. Total tasks submitted: 5
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #13 on host archangel-lapi7, executor 1 at 00:14.761s. Total tasks submitted: 6
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #14 on host archangel-lapi7, executor 1 at 00:14.762s. Total tasks submitted: 7
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #15 on host archangel-lapi7, executor 1 at 00:14.763s. Total tasks submitted: 8
14/07/09 05:19:39 INFO DBpediaJobProgressListener: Stage #1: Finished task #15 at 00:21.974s. Completed: 1/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #12 at 00:22.097s. Completed: 2/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #11 at 00:22.235s. Completed: 3/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #8 at 00:22.296s. Completed: 4/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #9 at 00:22.774s. Completed: 5/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #14 at 00:22.855s. Completed: 6/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #10 at 00:22.862s. Completed: 7/8
14/07/09 05:19:41 INFO DBpediaJobProgressListener: Stage #1: Finished task #13 at 00:23.052s. Completed: 8/8
14/07/09 05:19:41 INFO DBpediaJobProgressListener: Stage #1: Finished stage saveAsNewAPIHadoopFile at DistDeduplicatingWriterDestination.scala:35 at 00:23.058s
14/07/09 05:19:41 INFO DBpediaJobProgressListener: Finished job #1
li: extracted 16556 pages in 00:08.601s (per page: 0.519510 ms; failed pages: 0).
Jul 09, 2014 5:19:41 AM org.dbpedia.extraction.dump.extract.DistExtractionJob run
INFO: li: 14 extractors (ArticleCategoriesExtractor,ArticleTemplatesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,GeoExtractor,InterLanguageLinksExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,RedirectExtractor,RevisionIdExtractor,ProvenanceExtractor,SkosCategoriesExtractor,ArticlePageExtractor), 14 datasets (page_links,revision_ids,page_ids,revision_uris,article_categories,skos_categories,labels,wikipedia_links,external_links,redirects,geo_coordinates,article_templates,category_labels,interlanguage_links) finished

It would be good to have lines like "node X: Y pages written" too.

nilesh-c commented 10 years ago

@jimkont After the currently pending 3 PRs are merged to master, it'd be great if you could test the framework out for yourself (I'll update the README right now so that it's all in there) and let me know if the logging is satisfactory and whether we can close this.