UTF-8 is not supported?

migalkin commented 7 years ago

I have a Fedbench query CD4:

SELECT ?actor ?news WHERE {
  ?film purl:title 'Tarzan' .
  ?film linkedMDB:actor ?actor .
  ?actor owl:sameAs ?x.
  ?y owl:sameAs ?x .
  ?y nytimes:topicPage ?news }

which has been rewritten to execute the following triple pattern against LinkedMDB endpoint in LDF server:

SELECT ?actor ?x WHERE { ?actor <http://www.w3.org/2002/07/owl#sameAs> ?x} LIMIT 100000 OFFSET 0

The Client throws the error:

WARNING TriplePatternIterator Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
      events.js:160
     throw er; // Unhandled 'error' event
     ^

 Error: Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
     at N3Lexer._syntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:358:12)
     at reportSyntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:325:54)
     at N3Lexer._tokenizeToEnd (/ldf_rest/node_modules/n3/lib/N3Lexer.js:311:18)
    at TrigFragmentIterator._parseData (/ldf_rest/node_modules/n3/lib/N3Lexer.js:393:16)
    at TrigFragmentIterator.TurtleFragmentIterator._transform (/ldf_rest/node_modules/ldf-client/lib/triple-pattern-fragments/TurtleFragmentIterator.js:47:8)
     at Immediate.readAndTransform (/ldf_rest/node_modules/asynciterator/asynciterator.js:959:12)
     at runCallback (timers.js:643:20)
     at tryOnImmediate (timers.js:610:5)
     at processImmediate [as _immediateCallback] (timers.js:582:5)

Does it mean that LDF Client does not support UTF-8?

RubenVerborgh commented 7 years ago

Hi @migalkin, no, it means that the dataset was wrongly encoded. Note that

<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>

is an invalid URI in Turtle syntax; it should be

<http://dbpedia.org/resource/Espen_Skj\u00C3\u00B8nberg>

My guess is that on the server side, you have used an HDT file to serve LinkedMDB? And that this HDT file was generated with rdf2hdt in.nt out.hdt rather than rdf2hdt -f turtle in.nt out.hdt? The -f turtle option is necessary, because the N-Triples parser is broken.

migalkin commented 7 years ago

Thank you @RubenVerborgh I used the -f turtle option and now the query works fine [and the size of the hdt file is 20 times less =) ]

RubenVerborgh commented 7 years ago

Excellent 😄

RubenVerborgh commented 7 years ago

=> Do double check whether all the triples you want are in there though (i.e., hdtInfo out.hdt should show the correct number of total triples). When SERD encounters an error, the conversion process stops (most of the time with an error, sometimes without unfortunately).

migalkin commented 7 years ago

@RubenVerborgh actually you are right, the dump created with the broken NT parser created an HDT file with all the triples from the LinkedMDB dump, but rdf2hdt -f turtle linkedmdb.nt linkedmdb.hdt results only in 160142 triples.

So what I do:

rdf2hdt -f turtle linkedmdb-latest-dump.nt linkedmdb-latest-dump.hdt            
RDF format: turtle
invalid IRI character `?' (escape %8B7E)essed.: 0 % / 0 %                      
invalid IRI character `?'00 K triples processed.: 0 % / 0 %                      
invalid IRI character `@' (escape %8B7E)ed.: 0 % / 50 %                      
invalid IRI character `@'K triples processed.: 0 % / 50 %                      
HDT Successfully generated.                                           
Total processing time: Clock(1 sec 836 ms 366 us)  User(1 sec 762 ms 380 us)  System(71 ms 897 us)

Then running hdtInfo:

<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#triples> "160142" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#properties> "8" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctSubjects> "149209" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctObjects> "52182"

The original linkedmdb dump has:

 wc -l linkedmdb-latest-dump.nt 
6148121 linkedmdb-latest-dump.nt

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

RubenVerborgh commented 7 years ago

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

Yes, I just fixed that in https://github.com/rdfhdt/hdt-cpp/commit/d3b02a965589a250d5c1ffa7f8ba6d9000d83513

The solution is to ensure that the input file is valid, by passing it through a tool such as SERD first.

migalkin commented 7 years ago

@RubenVerborgh I used those regexps we found before to clean the entire LinkedMDB and retain all the triples, so that SERD and HDT parser never throw an error, so the parsing went fine. However, when I attach a new hdt to the server I have an error during setting it up: This software cannot open this version of HDT File I used the new version of the HDT C++ library you updated today. Server issue?

RubenVerborgh commented 7 years ago

Not a server issue, but possibly an outdated HDT-Node version. Can you post your HDT file somewhere so I can check?

RubenVerborgh commented 7 years ago

Never mind, I found a testcase myself. On it.

migalkin commented 7 years ago

In case you need https://drive.google.com/file/d/0B3uXlknE4eJrZ19hem03M1VKVkk/view?usp=sharing

RubenVerborgh commented 7 years ago

@migalkin I found the bug and proposed a fix: https://github.com/rdfhdt/hdt-cpp/pull/43

Summary: you built your HDT file using the latest master, which writes an (in my opinion) incorrect version number into the HDT file. The stable branch does not have this problem.

RubenVerborgh commented 7 years ago

@migalkin This bug is now fixed; the laster version of hdt-cpp now generates compatible HDT files again.

migalkin commented 7 years ago

@RubenVerborgh great, thanks for the update

LinkedDataFragments / Client.js

UTF-8 is not supported? #26