KonradHoeffner / hdt

Library for the Header Dictionary Triples (HDT) compression file format for RDF data.
https://crates.io/crates/hdt
MIT License
19 stars 4 forks source link

subject IDs off by one #4

Closed KonradHoeffner closed 1 year ago

KonradHoeffner commented 1 year ago
[TripleId { subject_id: 0, predicate_id: 90, object_id: 13304 }, TripleId { subject_id: 0, predicate_id: 101, object_id: 19384 }, TripleId { subject_id: 0, predicate_id: 111, object_id: 75817 }, TripleId { subject_id: 1, predicate_id: 90, object_id: 19470 }, TripleId { subject_id: 1, predicate_id: 101, object_id: 13049 }, TripleId { subject_id: 1, predicate_id: 104, object_id: 13831 }, TripleId { subject_id: 1, predicate_id: 111, object_id: 75817 }, TripleId { subject_id: 2, predicate_id: 90, object_id: 19313 }]
sample [
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/barry-norton",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_2",
        "http://data.semanticweb.org/person/reto-krummenacher",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/robert-isele",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_2",
        "http://data.semanticweb.org/person/anja-jentzsch",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_3",
        "http://data.semanticweb.org/person/christian-bizer",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq",
    ),
    (
        "_:b10",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/raphael-troncy",
    ),
]

IDs should start with 1 but subject IDs start at 0 which offsets all subjects by one, except the first one because 0 and 1 both map to the first.

KonradHoeffner commented 1 year ago

There were actually two bugs:

  1. the off-by-one error was caused by x starting with 0, fixed by starting at 1
  2. for unknown reasons there was another bug which shifted the subject ids even more further into the file