ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
355 stars 42 forks source link

ERROR: This code should be unreachable. In file "/app/src/parser/RdfEscaping.cpp " at line 145 #1362

Closed fbelleau closed 3 months ago

fbelleau commented 3 months ago

I have a parsing error in my ntriple input file of more than 200 millions triples.

Here is the message:

echo '{ "ascii-prefixes-only": false, "num-triples-per-batch": 10000 }' > flymine-object.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.flymine-object docker.io/adfreiburg/qlever:latest -c 'zcat flymine-object.nt.gz | IndexBuilderMain -F ttl -f - -i flymine-object -s flymine-object.settings.json --stxxl-memory 5G | tee flymine-object.index-log.txt'

2024-06-06 04:51:45.817 - INFO: QLever IndexBuilder, compiled on Tue Apr  2 19:02:03 UTC 2024 using git hash 25449d
2024-06-06 04:51:45.820 - INFO: You specified the input format: TTL
2024-06-06 04:51:45.820 - INFO: Processing input triples from /dev/stdin ...
2024-06-06 04:51:45.822 - INFO: Locale was not specified in settings file, default is en_US
2024-06-06 04:51:45.822 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-06-06 04:51:45.823 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files that don't include multiline literals with unescaped newline characters and that have newline characters after the end of triples.
2024-06-06 04:51:45.823 - INFO: You specified "num-triples-per-batch = 10,000", choose a lower value if the index builder runs out of memory
2024-06-06 04:51:45.823 - INFO: Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2024-06-06 04:57:11.944 - INFO: Input triples processed: 100,000,000
2024-06-06 04:58:20.381 - ERROR: This code should be unreachable. In file "/app/src/parser/RdfEscaping.cpp " at line 145

Is there a way to know which line is in problem with a log option? Those input format errors are hard to diagnosed without proper message to identify the line involved.

The N-Triples file is available here for you to test:

https://huggingface.co/datasets/bio2rdf/flymine_nt/blob/main/flymine-object.nt.gz

tuukka commented 3 months ago

First observation: Your file is not a valid N-Triples file, because the first line contains a literal number which is not in quotes.

Second observation: I tried to validate the file as a Turtle file using Serdi, and after running for some 5 minutes, got this error:

$ zcat "flymine-object.nt.gz?download=true" | serdi -i Turtle - >/dev/null
error: (stdin):122031261:1113: invalid escape `\.'
tuukka commented 3 months ago

And the syntax error is obvious when looking at the end of the offending line (122031261):

<http://bio2rdf.org/flymine:Comment:17676592> <http://bio2rdf.org/flymine_voc:description> 'Calcium/calmodulin-dependent protei
n kinase that operates in the calcium-triggered CaMKK-CaMK1 signaling cascade and, upon calcium influx, regulates transcription
 activators activity, cell cycle, hormone production, cell differentiation, actin filament organization and neurite outgrowth. 
Recognizes the substrate consensus sequence [MVLIF]-x-R-x(2)-[ST]-x(3)-[MVLIF]. Regulates axonal extension and growth cone moti
lity in hippocampal and cerebellar nerve cells. Upon NMDA receptor-mediated Ca(2+) elevation, promotes dendritic growth in hipp
ocampal neurons and is essential in synapses for full long-term potentiation (LTP) and ERK2-dependent translational activation.
 Downstream of NMDA receptors, promotes the formation of spines and synapses in hippocampal neurons by phosphorylating ARHGEF7/
BETAPIX on \'Ser-516\', which results in the enhancement of ARHGEF7 activity and activation of RAC1. Promotes neuronal differen
tiation and neurite outgrowth by activation and phosphorylation of MARK2 on \'Ser-91\', \'Ser-92\...' .

Does this help you forward, or would you need help in fixing the file?

joka921 commented 3 months ago

@tuukka Thank you for helping with the analysis of the file

@fbelleau You are right to expect a better error message here. There are two errors in place here, the first being that this shouldn't read as an internal assertion, and the second one is that it should backpropagate to the parser which should give you a little more context around the offending line. I hope I'll be able to fix this some time in the future.

In general:

  1. QLever's parser treats NTriples files as Turtle files (we have no distinct NTriples parser, but Ntriples is a subset of Turtle, that is why triples like <a> <b> 24 . work as expected although they are technically not NTriples files.

  2. We recommend to validate your Turtle/Ntriples files using a dedicated tool vor validation of RDF, like the validators from Apache Jena (or probably the tool that @tuukka used) if you run into problems with our parser to find out, whether the problem is in the dataset or in QLever.

fbelleau commented 3 months ago

Thank you for your help. I have fixed the escape problem. serdi is very useful.