Closed migalkin closed 7 years ago
Hi @migalkin, no, it means that the dataset was wrongly encoded. Note that
<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>
is an invalid URI in Turtle syntax; it should be
<http://dbpedia.org/resource/Espen_Skj\u00C3\u00B8nberg>
My guess is that on the server side, you have used an HDT file to serve LinkedMDB? And that this HDT file was generated with rdf2hdt in.nt out.hdt
rather than rdf2hdt -f turtle in.nt out.hdt
? The -f turtle
option is necessary, because the N-Triples parser is broken.
Thank you @RubenVerborgh
I used the -f turtle
option and now the query works fine [and the size of the hdt file is 20 times less =) ]
Excellent 😄
=> Do double check whether all the triples you want are in there though (i.e., hdtInfo out.hdt
should show the correct number of total triples). When SERD encounters an error, the conversion process stops (most of the time with an error, sometimes without unfortunately).
@RubenVerborgh actually you are right, the dump created with the broken NT parser created an HDT file with all the triples from the LinkedMDB dump, but
rdf2hdt -f turtle linkedmdb.nt linkedmdb.hdt
results only in 160142 triples.
So what I do:
rdf2hdt -f turtle linkedmdb-latest-dump.nt linkedmdb-latest-dump.hdt
RDF format: turtle
invalid IRI character `?' (escape %8B7E)essed.: 0 % / 0 %
invalid IRI character `?'00 K triples processed.: 0 % / 0 %
invalid IRI character `@' (escape %8B7E)ed.: 0 % / 50 %
invalid IRI character `@'K triples processed.: 0 % / 50 %
HDT Successfully generated.
Total processing time: Clock(1 sec 836 ms 366 us) User(1 sec 762 ms 380 us) System(71 ms 897 us)
Then running hdtInfo:
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#triples> "160142" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#properties> "8" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctSubjects> "149209" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctObjects> "52182"
The original linkedmdb dump has:
wc -l linkedmdb-latest-dump.nt
6148121 linkedmdb-latest-dump.nt
The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.
The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.
Yes, I just fixed that in https://github.com/rdfhdt/hdt-cpp/commit/d3b02a965589a250d5c1ffa7f8ba6d9000d83513
The solution is to ensure that the input file is valid, by passing it through a tool such as SERD first.
@RubenVerborgh I used those regexps we found before to clean the entire LinkedMDB and retain all the triples, so that SERD and HDT parser never throw an error, so the parsing went fine.
However, when I attach a new hdt to the server I have an error during setting it up:
This software cannot open this version of HDT File
I used the new version of the HDT C++ library you updated today.
Server issue?
Not a server issue, but possibly an outdated HDT-Node version. Can you post your HDT file somewhere so I can check?
Never mind, I found a testcase myself. On it.
@migalkin I found the bug and proposed a fix: https://github.com/rdfhdt/hdt-cpp/pull/43
Summary: you built your HDT file using the latest master
, which writes an (in my opinion) incorrect version number into the HDT file. The stable
branch does not have this problem.
@migalkin This bug is now fixed; the laster version of hdt-cpp now generates compatible HDT files again.
@RubenVerborgh great, thanks for the update
I have a Fedbench query CD4:
which has been rewritten to execute the following triple pattern against LinkedMDB endpoint in LDF server:
The Client throws the error:
Does it mean that LDF Client does not support UTF-8?