Important Links: Gitbook Docu, Join Discord, Databus used in the DBpedia Project
The DBpedia Databus is a transformative platform for agile data integration, collaboration, and automation via a structured metadata Knowledge Graph. The Databus implements well-known concepts from Software Engineering such as agile rapid prototyping, build automation, test-driven development for Data Engineering and connects data via a loosely-coupled bus system and a common, but extensible metadata format through state-of-the-art semantic technologies such as ontologies, SPARQL, SHACL and Linked Data.
🔥 Hot Fact: The DBpedia Databus condenses 15 years of expertise from the DBpedia Community into an accessible open-source tool.
Communities, public organisations, researchers and data enthusiasts can deploy their own DBpedia Databus as a powerful productivity tool to foster collaboration on Open Data. Bringing into existence such participatory models with community feedback and upstream contribution are drivers for Democratized Knowledge, Accelerated Innovation and Transparent Governance. Contrary to conventional publisher-centric data “publishing” platforms, the growing Databus Network is optimised for data consumers by innovating discovery, findability and access. Data flows to where it is needed and appreciated - the people who build amazing things with it.
đź’ˇ Spotlight: The Databus Network places the power in the hands of data consumers, streamlining discovery and access. Data flows to where it is valued and ignites creativity.
The DBpedia Databus addresses a significant gap in tooling: the “pre and post-processing”, which wastes precious time in the magnitude of months of every data-intensive project. DBpedia’s tech stack comprises a complex extraction software producing data releases, an online database and several web services. We fully automated the pipeline and application deployment with the Databus and saved 90% of time (2 full-time engineers to 1 part-time engineer), while increasing productivity 5-fold by shortening the release cycle from 17 to 3 months with improved quality, i.e. automated data validation tests over 14 billion facts. Benefit now from our power tool, which tackles the pain points in Data Engineering: efficiency, automation, scalability and data quality. Databus provides an efficient environment for initial identification and acquisition from existing Databuses to low-level tasks such as conversion, normalization to data-quality control and debugging pipelines to loading the data into the final application.
💪 Strength: The DBpedia Databus addresses industry pain points head-on – efficiency, automation, scalability, and data quality – ensuring your data projects are set for success.
Databus is designed as a lightweight and agile solution and fits seamlessly into existing environments.
We identified these deployment levels with our partners:
🚀 Try our quickstart guide for downloading data from DBpedia (no registration necessary). Currently, 380,000 files are available on ~30 servers via https://databus.dbpedia.org alone, serving ~200,000 requests per day. This Github repo lets you deploy your own bus for your own data.
The DBpedia Databus, maintained by the Institute for Applied Informatics (InfAI), Leipzig, is not just a tool but a catalyst for data innovation. Our team is eager to connect, collaborate, and form strategic partnerships to shape the future of data management.
If you're interested in exploring collaborations, encountering issues, or just have a question, we'd love to hear from you:
Your interest and involvement will greatly contribute to the Databus community. Let's shape the future of data management together.
Get in contact via the informal dev channel on Discord or reach out to the Databus Management Team to explore partnership opportunities. Your data journey transformation begins here.
Currently, we are migrating databus.dbpedia.org and energy.databus.dbpedia.org to 2.1.0-rc8 then we beta test it some more.
Development of the Databus started in 2018 as means to manage the DBpedia Knowledge Graph extraction more efficiently. In the first 5 years, we fireproofed Databus online at the public beta at databus.dbpedia.org and refined the Metadata model. Starting with the first public release version 2.1.0, the core model aka the Databus Ontology is stable.
ℹ️ Learn how to do a roundtrip: Prepare data, deploy a databus, upload data, query and download. Alternatively, you can also start with just the download guide on existing Databuses.
Databus does not store the data itself but only metainformation, therefore before running the server we need to publish our data on the internet and make it publicly available, normally via HTTPS.
This step requires a URI or several URIs resolving to the actual data files for download.
As an example here we can publish a single file, e.g. this README.md
. So our URI is (note that we are using permalink from particular commit because the files for publishing must be static, see more in our Publishing Guide):
https://raw.githubusercontent.com/dbpedia/databus/68f976e29e2db15472f1b664a6fd5807b88d1370/README.md
ℹ️ Explanation and variants:
https://user:pass@example.com
), local IPs ('127.0.0.1' or '192.168.x.x'), using file://
.In order to run the Databus on-premise you will need docker
and docker-compose
installed on your machine.
docker
: 20.10.2 or higherdocker-compose
: 1.25.0 or higherClone the repository or download the docker-compose.yml
and .env
files.
Both files need to exist in the same directory. Navigate to
the directory with the files (root of the repo).
run:
docker-compose up
The Databus should be available at http://localhost:3000
.
ℹ️ Further notes:
databus.example.org
. To publish an artifact you need to create a Databus account.
After creating an account, log in and click on your account's
icon and then Publish Data
.
Fill in the form for publishing and submit.
For simplicity, you can enter any name for group, artifact and version.
Use the URI of the file we prepared for publishing (https://raw.githubusercontent.com/dbpedia/databus/68f976e29e2db15472f1b664a6fd5807b88d1370/README.md
)
in the Files
section.
After publishing the data should be visible on account icon -> My Account -> Data tab
.
ℹ️ Notes:
After files are published, we can perform queries. Databus offers two mechanisms for that: a SPARQL endpoint and Collections.
Collections (user-created data catalogues) allow to flexibly combine files and artifacts together and share the collection links. Collections provide a tool to build, store and share SPARQL queries.
Read more here.
The SPARQL endpoint at localhost:3000/sparql allows to run queries directly. Use this query to retrieve all links available on a Databus. The link you uploaded in the previous step should be in the result. See more examples of the SPARQL queries in examples.
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT ?file WHERE {
?distributions dcat:downloadURL ?file .
}
ℹ️ SPARQL allows for the SERVICE keyword, that allows federated querying over several databuses.
Databus offers metadata extensions using Mods. You can read about them more in detail here.
Instead of using GUI, you can automate your publishing and data retrieving process using our http-API. Refer to it here.
We use milestones, that are roughly 3 months long, see here. Issues are sorted into these milestones as a rough orientation, when they will be tackled. Of course, if they are delayed or prove to be too difficult, we will push them back to the next milestone. Issues, which are clear candidates to be pushed back are labeled stretch task
. Milestone 2.x.x
is the backlog and might be picked, if no other issues are more urgent or important. Note that we have a soft voting mechanism: adding đź‘Ť to the issue (under the post) as a reaction helps us to prioritize.
Please report issues in our github repository.
If you would like to submit a non-trivial patch or pull request we will need you to sign the Contributor License Agreement, we will send it to you in that case.
The source code of this repo is published under the Apache License Version 2.0
Databus is configured so that the default license of all metadata is CC-0, which is relevant for all data of the Model, i.e. who published which data, when and under which license.
The individual datasets are referenced via links (dcat:downloadURL) and can have any license.
This work was partially supported by grants from the German Federal Ministry for Economic Affairs and Climate Action (BMWK) to the projects LOD-GEOSS (03EI1005E) and PLASS (01MD19003D)