KonradHoeffner / hdt

Library for the Header Dictionary Triples (HDT) compression file format for RDF data.
https://crates.io/crates/hdt
MIT License
19 stars 4 forks source link

cyrillic URIs not found? #39

Closed KonradHoeffner closed 7 months ago

KonradHoeffner commented 7 months ago

When using the Sophia HDT adapter in RickView, a URI with suffix хобби-N-0 is not found:

[WARN ] No triples found for entry/хобби-N-0. Did you configure the namespace correctly?

However when using Turtle, it works:

[DEBUG] ruthes:entry/хобби-N-0 HTML 357.8µs

This could also be an issue with Sophia, but this is unlikely as it works with the Sophia FastGraph where the Turtle is loaded, or it could be an error in the Sophia adapter, which is part of HDT.

As first step, a URI with a suffix like хобби-N-0 should be included in the test HDT file and the test suite.

KonradHoeffner commented 7 months ago

Added a test case with a Cyrillic URI and label in https://github.com/KonradHoeffner/hdt/commit/326979c7be32d14db005dc9c07bdd7a6daeae0b3 but that one already succeeds so it seems as if that bug is caused by the RickView code that handles the RDF HDT variant.

KonradHoeffner commented 7 months ago

After more research, this works in RickView when using the HDT test file extended by the resource in question, so it may be an issue with the HDT library after all that only occurs with the complete Ruthes file.

KonradHoeffner commented 7 months ago

Test data at https://drive.google.com/file/d/1k7P5s1cx9AvB6qhoudbJr-wEOsPoQBOJ/view?usp=sharing. One of the problematic subject is http://lod.ruthes.org/resource/entry/хобби-N-0.

KonradHoeffner commented 7 months ago

The data does exist in the converted file and is found by the hdt_cpp hdt_search:

# hdtSearch ruthes.hdt
Predicate Bitmap in 156 ms 580 us 14.86 %                                      
Count predicates in 777 ms 478 us 99.755 % / 34.255 %                       
Count Objects in 186 ms 238 us Max was: 1729489.464 %                      
Bitmap in 33 ms 535 us: 90.379 % / 48.202 %                      
Bitmap bits: 15236539 Ones: 4425810
Object references in 1 sec 125 ms 594 us/ 68.012 %                      
Sort lists in 1 sec 595 ms 489 us% / 77.727 %                      
Index generated in 3 sec 875 ms 25 us
>> http://lod.ruthes.org/resource/entry/хобби-N-0 ? ?
http://lod.ruthes.org/resource/entry/хобби-N-0 http://www.lexinfo.net/ontology/3.0/lexinfo#partOfSpeech http://www.lexinfo.net/ontology/3.0/lexinfo#noun
http://lod.ruthes.org/resource/entry/хобби-N-0 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/ns/lemon/ontolex#Word
http://lod.ruthes.org/resource/entry/хобби-N-0 http://www.w3.org/2000/01/rdf-schema#label "ХОББИ"
http://lod.ruthes.org/resource/entry/хобби-N-0 http://www.w3.org/ns/lemon/ontolex#canonicalForm http://lod.ruthes.org/resource/form/хобби-Noun-neuter-nominative-singular
http://lod.ruthes.org/resource/entry/хобби-N-0 http://www.w3.org/ns/lemon/ontolex#sense http://lod.ruthes.org/resource/sense/129491-хобби-n-0
http://lod.ruthes.org/resource/entry/хобби-N-0 http://www.w3.org/ns/prov#wasDerivedFrom http://lod.ruthes.org/resource/ruthes-lite
http://lod.ruthes.org/resource/entry/хобби-N-0 http://www.w3.org/ns/prov#wasDerivedFrom http://lod.ruthes.org/resource/zaliznyak-dictionary
7 results in 178 us
KonradHoeffner commented 7 months ago

OK I finally got some headway into the problem: the entry does exist and is found at ID 136377 while looping through all entries of the shared dictionary and extracting them. The binary search also is working correctly, however for some reason the Russian o letter is gone while building up the entry so those are broken:

search shared
locating element http://lod.ruthes.org/resource/entry/хобби-N-0 in block 8523 with block_size 16
in block http://lod.ruthes.org/resource/entry/хмурить_ицо-VG-0
in block http://lod.ruthes.org/resource/entry/хмурить_иб-VG-0
in block http://lod.ruthes.org/resource/entry/хмуриться-V-0
in block http://lod.ruthes.org/resource/entry/хмурсть-N-0
in block http://lod.ruthes.org/resource/entry/хмурый-Adj-0
in block http://lod.ruthes.org/resource/entry/ха-N-0
less common characters between http://lod.ruthes.org/resource/entry/ха-N-0 and http://lod.ruthes.org/resource/entry/хобби-N-0, not found
in block http://lod.ruthes.org/resource/entry/хаыкать-V-0
in block http://lod.ruthes.org/resource/entry/хбби-N-0

The last entry is the garbled "hobby".

Those are generated with the extract function:

http://lod.ruthes.org/resource/entry/хмурить_лицо-VG-0 at id 136370
http://lod.ruthes.org/resource/entry/хмурить_лоб-VG-0 at id 136371
http://lod.ruthes.org/resource/entry/хмуриться-V-0 at id 136372
http://lod.ruthes.org/resource/entry/хмурость-N-0 at id 136373
http://lod.ruthes.org/resource/entry/хмурый-Adj-0 at id 136374
http://lod.ruthes.org/resource/entry/хна-N-0 at id 136375
http://lod.ruthes.org/resource/entry/хныкать-V-0 at id 136376
http://lod.ruthes.org/resource/entry/хобби-N-0 at id 136377
http://lod.ruthes.org/resource/entry/хобот-N-0 at id 136378

The first broken one is "хна" (hair color with the German name "Henna") which gets reduced to "ха".

KonradHoeffner commented 7 months ago

I think the error is somewhere here:

       while (id_in_block < self.block_size) && (pos < self.packed_data.len()) {
           // Decode prefix
           let (delta, vbyte_bytes) = decode_vbyte_delta(&self.packed_data, pos);
           pos += vbyte_bytes;

           //Copy suffix
           let slen = self.strlen(pos);
           temp_string.truncate(temp_string.floor_char_boundary(delta));
           temp_string.push_str(self.pos_str(pos, slen));
           println!("in block {temp_string}");
           if delta >= cshared {
               // Current delta value means that this string has a larger long common prefix than the previous one
               let boundary = temp_string.floor_char_boundary(cshared);
               cshared += Self::longest_common_prefix(
                   temp_string[boundary..].as_bytes(),
                   element[boundary..].as_bytes(),
               );
KonradHoeffner commented 7 months ago

It could also be here:

  fn pos_str(&self, pos: usize, slen: usize) -> &str {
      assert!(
          pos + slen < self.packed_data.len(),
          "Invalid arguments pos_str({pos},{slen}), packed data len {}).",
          self.packed_data.len()
      );
      if let Ok(s) = str::from_utf8(&self.packed_data[pos..pos + slen]) {
          s
      } else {
          error!(
              "invalid UTF8, skipping a byte {}",
              String::from_utf8_lossy(&self.packed_data[pos..pos + slen])
          );
          self.pos_str(pos + 1, slen)
      }
  }
KonradHoeffner commented 7 months ago

The error macro does trigger a lot but it is not shown even with cargo test -- --nocapture.

donpellegrino commented 7 months ago

It is not clear that the issue https://github.com/rdfhdt/hdt-cpp/issues/219 is necessarily relevant. But I wanted to mention it here since it does have to do with string handling in the original C++ implementation. Maybe logic for handling strings exists somewhere during the creation of the test HDT file that is corrupting the inputs to the Rust queries.

KonradHoeffner commented 7 months ago

Thank you, I will look it up! And generally see if I can get some inspiration from the cpp implementation :-)

KonradHoeffner commented 7 months ago

The first byte seems to be invalid UTF-8:

ocating element http://lod.ruthes.org/resource/entry/хобби-N-0 in block 8523 with block_size 16
invalid UTF8, skipping a byte �ицо-VG-0
invalid UTF8, skipping a byte �ицо-VG-0
in block http://lod.ruthes.org/resource/entry/хмурить_ицо-VG-0 with new part ицо-VG-0
delta 53 >= cshared 0
invalid UTF8, skipping a byte �б-VG-0
invalid UTF8, skipping a byte �б-VG-0
in block http://lod.ruthes.org/resource/entry/хмурить_иб-VG-0 with new part б-VG-0
delta 55 >= cshared 40
in block http://lod.ruthes.org/resource/entry/хмуриться-V-0 with new part ся-V-0
delta 51 >= cshared 41
invalid UTF8, skipping a byte �сть-N-0
invalid UTF8, skipping a byte �сть-N-0
in block http://lod.ruthes.org/resource/entry/хмурсть-N-0 with new part сть-N-0
delta 46 >= cshared 41
in block http://lod.ruthes.org/resource/entry/хмурый-Adj-0 with new part ый-Adj-0
delta 45 >= cshared 41
invalid UTF8, skipping a byte �а-N-0
invalid UTF8, skipping a byte �а-N-0
in block http://lod.ruthes.org/resource/entry/ха-N-0 with new part а-N-0
less common characters between http://lod.ruthes.org/resource/entry/ха-N-0 and http://lod.ruthes.org/resource/entry/хобби-N-0, not found
in block http://lod.ruthes.org/resource/entry/хаыкать-V-0 with new part ыкать-V-0
delta 41 >= cshared 41
invalid UTF8, skipping a byte �бби-N-0
invalid UTF8, skipping a byte �бби-N-0
KonradHoeffner commented 7 months ago

OK found the error: I shouldn't have meddled with UTF-8 in the first place and just do everything with byte vectors and create the String at the end, just like in the extract function.