Homepage: http://dbpedia.org
Documentation: http://dev.dbpedia.org/Extraction
Get in touch with DBpedia: https://wiki.dbpedia.org/join/get-in-touch
Slack: join the #dev-team slack channel within the the DBpedia Slack workspace - the main point for developement updates and discussions
DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.
To check out the projects of DBpedia, visit the official DBpedia website.
Running the extraction framework is a relatively complex task which is in details documented in the advanced QuickStart guide. To run the extraction process same as the DBpedia core team does, you can do using the MARVIN release bot. The MARVIN bot automates the overall extraction process, from downloading the ontology, mappings and Wikipedia dumps, to extraction and post-processing the data.
git clone https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
cd marvin-config
./setup-or-reset-dief.sh
# test run Romanian extraction, very small
./marvin_extraction_run.sh test
# around 4-7 days
./marvin_extraction_run.sh generic
If you plan to work on improving the codebase of the framework you would need to run the extraction framework alone as described in the QuickStart guide. This is highly recommended, since during this process you will learn a lot about the extraction framework.
Extractors represent the core of the extraction framework. So far, many extractors have been developed for extraction of particular information from different Wikimedia projects. To learn more, check the New Extractors guide, which explains the process of writing new extractor.
Check the Debugging Guide and learn how to debug the extraction framework.
In order to speed up the extraction process, the extraction framework has been adopted to run on Apache Spark. Currently, more than half of the extractors can be executed using Spark. The extraction process using Spark is a slightly different process and requires different Execution. Check the QuickStart guide on how to run the extraction using Apache Spark.
Note: if possible, new extractors should be implemented using Apache Spark. To learn more, check the New Extractors guide, which explains the process of writing new extractor.
The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia extraction framework is written using Scala 2.8. The framework is available from the DBpedia Github repository (GNU GPL License). The change log may reveal more recent developments. More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki
The DBpedia extraction framework is structured into different modules
Components
In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:
More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki/Extraction-Instructions.
To know more about the extraction framework, click here
If you want to work on one of the issues, assign yourself to it or at least leave a comment that you are working on it and how.
If you have an idea for a new feature, make an issue first, assign yourself to it, then start working.
Please make sure you have read the Developer's Certificate of Origin, further down on this page!
git clone <your_repo_url_on_github>
).dev
branch (git checkout dev
).git checkout dev -b fixRestApiParams
).rebase
your branch onto extraction-framework/dev
(git pull --rebase git://github.com/dbpedia/extraction-framework.git
). Resolve possible conflicts and commit.git push origin fixRestApiParams
).extraction-framework/dev
via GitHub.
extraction-framework/dev
, finally the dev
together with your improvements will be merged with the master
branch.Please keep in mind:
rebase
from extraction-framework/master
). Only rebase your branch onto the dev branch, if and only if nobody already pulled from the development branch!--force
to the push command.More tips:
By sending a pull request to the extraction-framework repository on GitHub, you implicitly accept the Developer's Certificate of Origin 1.1
The source code is under the terms of the GNU General Public License, version 2.