ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
404 stars 51 forks source link

Skip parse error #1583

Open debayan opened 1 week ago

debayan commented 1 week ago

Is there an option in QLever to skip lines that produce parse errors during indexing?

hannahbast commented 3 days ago

@debayan Can you give an example of the kind of line you would like to skip?

Roughly speaking, there are two sorts of errors in RDF input:

  1. Errors that, strictly speaking, violate the standard, but are kind of OK to accept. For example, an IRI that contains a space. QLever currently outputs a WARNing for those, but accepts them.

  2. Errors that should really be fixed by the producers of the dataset because they point to a deeper problem. For example, an N-Triples file containing a line, where the object is missing, or where the closing " of a literal is missing.

debayan commented 3 days ago

@hannahbast

My log shows something like:

INFO: By default, integers that cannot be represented by QLever will throw an exception INFO: Parsing input triples and creating partial vocabularies, one per batch ... ERROR: Parse error at byte position 6388106517: Parse error at byte position 6388106517: Value 400.000 could not be parsed as an integer value

I get this when parsing ttl files from https://downloads.dbpedia.org/repo/lts/wikidata/. This is not the first error I got. I fixed several such errors already (not just integer errors), and I do not know how many other errors exist in this data dump. Since this is a large amount of data, I would rather just skip all such errors and add the dump to my DB.

hannahbast commented 3 days ago

@debayan Have you validated the files? Here is a command to do that. Does it only produce warnings or also errors?

docker run -i --rm -v $(pwd):/data stain/jena riot --validate /data/filename.ttl

And may I ask why you want to use DBpedia? It's old, not well maintained anymore, and of really doubtful quality. I don't think there is anything useful in DBpedia that is not contained in one of the more modern knowledge graphs, notably Wikidata.

debayan commented 3 days ago

@hannahbast I have not validated the files, and I know it has erroneous lines. I am using DBpedia because we are working on a task where the queries from one KG need to be translated to queries that work on another KG. The only dataset we could find of a reasonable size is LC-QuAD 2.0 which has queries for both KGs for a given question.