Open debayan opened 1 week ago
@debayan Can you give an example of the kind of line you would like to skip?
Roughly speaking, there are two sorts of errors in RDF input:
Errors that, strictly speaking, violate the standard, but are kind of OK to accept. For example, an IRI that contains a space. QLever currently outputs a WARNing for those, but accepts them.
Errors that should really be fixed by the producers of the dataset because they point to a deeper problem. For example, an N-Triples file containing a line, where the object is missing, or where the closing " of a literal is missing.
@hannahbast
My log shows something like:
INFO: By default, integers that cannot be represented by QLever will throw an exception INFO: Parsing input triples and creating partial vocabularies, one per batch ... ERROR: Parse error at byte position 6388106517: Parse error at byte position 6388106517: Value 400.000 could not be parsed as an integer value
I get this when parsing ttl files from https://downloads.dbpedia.org/repo/lts/wikidata/. This is not the first error I got. I fixed several such errors already (not just integer errors), and I do not know how many other errors exist in this data dump. Since this is a large amount of data, I would rather just skip all such errors and add the dump to my DB.
@debayan Have you validated the files? Here is a command to do that. Does it only produce warnings or also errors?
docker run -i --rm -v $(pwd):/data stain/jena riot --validate /data/filename.ttl
And may I ask why you want to use DBpedia? It's old, not well maintained anymore, and of really doubtful quality. I don't think there is anything useful in DBpedia that is not contained in one of the more modern knowledge graphs, notably Wikidata.
@hannahbast I have not validated the files, and I know it has erroneous lines. I am using DBpedia because we are working on a task where the queries from one KG need to be translated to queries that work on another KG. The only dataset we could find of a reasonable size is LC-QuAD 2.0 which has queries for both KGs for a given question.
Is there an option in QLever to skip lines that produce parse errors during indexing?