apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.12k stars 652 forks source link

Equality of RDFDatatypes is by reference only --> Should it stay that way? #2808

Closed arne-bdt closed 3 weeks ago

arne-bdt commented 3 weeks ago

Version

5.3.0-SNAPSHOT

Question

In apache/jena#2795, I discovered that no implementation of RDFDatatype provides an override for Object#equals or Object#hashCode, resulting in all comparisons being performed by reference only.

For instance, these tests fail:

@Test public void baseDataTypeEquality() {
    var rdfDataType1 = new BaseDatatype("urn:x-hp-dt:unknown");
    var rdfDataType2 = new BaseDatatype("urn:x-hp-dt:unknown");

    assertEquals(rdfDataType1, rdfDataType2);
}

@Test public void xsdDoubleEquality() {
    var rdfDataType1 = new XSDDouble("double", Double.class);
    var rdfDataType2 = XSDDatatype.XSDdouble;

    assertEquals(rdfDataType1, rdfDataType2);
}

To further illustrate the inconsistency, here’s a scenario where the first test fails, but the second one succeeds:

@Test
public void testEqualityA() {
    var l1 = LiteralLabelFactory.create("foo", new BaseDatatype("urn:x-hp-dt:unknown"));
    var l2 = LiteralLabelFactory.create("foo", new BaseDatatype("urn:x-hp-dt:unknown"));

    assertEquals(l1, l2);
}

@Test
public void testEqualityB() {
    var m = ModelFactory.createDefaultModel();
    var l1 = m.createTypedLiteral("foo", "urn:x-hp-dt:unknown");
    var l2 = m.createTypedLiteral("foo", "urn:x-hp-dt:unknown");

    assertEquals(l1, l2);
}

I propose implementing Object#equals and Object#hashCode in BaseDatatype to compare instances by their class and URI, as shown below:

@Override
public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;
    final BaseDatatype that = (BaseDatatype) o;
    return Objects.equals(uri, that.uri);
}

@Override
public int hashCode() {
    return Objects.hashCode(uri);
}

I couldn’t find any subclass of BaseDatatype where uniqueness isn’t defined by the URI, so this approach should cover all cases without needing additional overrides.

afs commented 3 weeks ago

The datatype registry provides the unique references. It's an open-ended enum.

LiteralLabelFactory should not be considered application facing API. It's in impl.

uniqueness isn’t defined by the URI

That's contract for RDF datatypes.

if (... || getClass() != o.getClass()) return false;

I think chaos will result if "same URI, different class". It will be lost on printing and storing in a database.

Maybe define equals and hashCode as final.

arne-bdt commented 3 weeks ago

@afs Sorry, I don´t understand parts of your comment.

The datatype registry provides the unique references. It's an open-ended enum.

--> What I understood: It should stay the way it is. By referece is fine, due to the registry.

LiteralLabelFactory` should not be considered application facing API. It's in impl.

--> What I understood: I did not choose a valid example.

uniqueness isn’t defined by the URI

--> Where does this citation come from?

That's contract for RDF datatypes.

--> The issue is about RDFDatatypes. Does this mean: The contract for RDF datatypes is that it´s uniqueness is not defined by the URI? Or is it? --> Since the URI is the unique key for the datatype registry where due to reference equality, all existing datatypes have to be registered, then all datatypes must have a unique URI. Or am I mistaken?

I think chaos will result if "same URI, different class". It will be lost on printing and storing in a database.

--> I agree, that would be ugly.

Maybe define equals and hashCode as final.

--> Yeah... but with which kind of implementation? By reference or compare by URI only?

afs commented 3 weeks ago

Where does this citation come from?

https://www.w3.org/TR/rdf11-concepts/#section-Datatypes

A datatype consists of a lexical space, a value space and a lexical-to-value mapping, and is denoted by one or more IRIs.

The URI "denotes", i.e. names, the pair of two spaces and the mapping. The spaces (sets) and the mapping (a function, a set of pairs) have value equality.

Same URI, same datatype.

but with which kind of implementation? By reference or compare by URI only?

By URI. It's a value object so while there maybe multiple java objects (it's hard to stop that), equality is by-value. c.f. String.