ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
402 stars 50 forks source link

Feature Request: Support for Offline Triplestore Dump to RDF Formats #1291

Open arcangelo7 opened 8 months ago

arcangelo7 commented 8 months ago

Hello QLever Team,

I've been exploring the capabilities of QLever and its control script, qlever-control, for managing SPARQL queries and datasets. To the best of my knowledge, I couldn't find a feature that allows for dumping the entire triplestore to an RDF file. This functionality is crucial for handling very large datasets efficiently.

For large triplestores, the approach of using SPARQL queries with OFFSET and LIMIT to paginate through results for dumping data becomes impractical due to time constraints. Similarly, attempting a single massive CONSTRUCT query to dump the entire dataset is not feasible due to memory limitations.

In comparison, Blazegraph offers a solution for this issue with its com.bigdata.rdf.sail.ExportKB class, enabling offline dumps of the triplestore in various formats such as N-Quads, JSON-LD, etc. This feature significantly simplifies managing and archiving large datasets.

My use case involves working with OpenCitations Meta, which comprises 4,236,287,432 triples for data and an additional 5,540,033,781 triples for provenance. Being able to dump our data from the triplestore into RDF formats is essential for our operations, and a similar feature in QLever would greatly benefit us and likely many others in the community.

Could you consider adding such a feature to QLever or qlever-control? An offline dump feature for the triplestore that supports multiple RDF formats would be a tremendous asset, especially for those of us dealing with extensive datasets.

Thank you for considering this request. Your efforts in developing and maintaining QLever are greatly appreciated.

hannahbast commented 8 months ago

@arcangelo7 Two questions:

  1. Can you briefly explain what is the advantage of dumping the complete dataset from a SPARQL endpoint vs. just downloading the dataset based on which the SPARQL endpoint was constructed.

  2. What exactly is impractical about multiple queries involving OFFSET and LIMIT? QLever does not support OFFSET for ?s ?p ?o queries yet, but that would be a relatively easy fix.