SwissDataScienceCenter / calamus

A JSON-LD Serialization Libary for Python
Apache License 2.0
29 stars 12 forks source link

Fix IRI serialization #83

Closed cmdoret closed 1 year ago

cmdoret commented 1 year ago

In calamus 0.4.0, IRI are serialized as xsd:string when using add_value_types=True. They should instead be serialized as xsd:anyURI. This PR addresses the issue.

Example:

from dataclasses import dataclass
from typing import Optional

from calamus.schema import JsonLDSchema
from calamus import fields
from rdflib.namespace import Namespace

SDO = Namespace("http://schema.org/")

@dataclass
class Organization:
    """See http//schema.org/Organization"""

    _id: str
    logo: Optional[str] = None

class OrganizationSchema(JsonLDSchema):
    _id = fields.Id()
    logo = fields.IRI(SDO.logo)

    class Meta:
        rdf_type = SDO.Organization
        model = Organization
        add_value_types = True

sdsc = Organization('https://datascience.ch', logo='https://datascience.ch/wp-content/uploads/2019/04/logo_SDSC-300x82.png')

print(OrganizationSchema().dumps(sdsc, indent=2))

Output before the PR:

{
  "@id": "https://datascience.ch",
  "http://schema.org/logo": {
    "@id": {
      "@value": "https://datascience.ch/wp-content/uploads/2019/04/logo_SDSC-300x82.png",
      "@type": "http://www.w3.org/2001/XMLSchema#string"
    }
  },
  "@type": [
    "http://schema.org/Organization"
  ]
}

Output after the PR:

{
  "@id": "https://datascience.ch",
  "http://schema.org/logo": {
    "@value": "https://datascience.ch/wp-content/uploads/2019/04/logo_SDSC-300x82.png",
    "@type": "http://www.w3.org/2001/XMLSchema#anyURI"
  },
  "@type": [
    "http://schema.org/Organization"
  ]
}
Panaetius commented 1 year ago

Hi @cmdoret thank you so much for opening a PR.

The way calamus currently handles it is clearly wrong. However, I'm not sure if the proposed solution is correct, either.

With the xsd datatype and without @id, it's not an IRI reference (reference to a node) anymore, it becomes a string property with type coercion.

I.e. in ttl <https://datascience.ch> <http://schema.org/logo> "https://datascience.ch/wp-content/uploads/2019/04/logo_SDSC-300x82.png"^^<http://www.w3.org/2001/XMLSchema#anyURI> . vs. <https://datascience.ch> <http://schema.org/logo> <https://datascience.ch/wp-content/uploads/2019/04/logo_SDSC-300x82.png> ., so it loses the <...>.

I think @id is already a sort of native type in JSON-LD. So not adding an XSD type in IRI fields might be more correct? Alternatively having two field types, one for IRI node references, one for IRI properties (which adds the xsd type) might fit better?

I couldn't really find any detail on if you could keep the @idness of the property, so to speak, while also adding an XSD type, maybe we could ask on the JSON-LD repo on what approach is correct?

It's certainly beyond me to judge if your proposed solution is identical to the output produced without add_value_types or if having it as a string, without <...> is meaningfully different (from a JSON-LD processor/RDF standpoint).

cmdoret commented 1 year ago

Thanks @Panaetius, I just had a chat with @rmfranken about this and it looks like there's no obvious way to represent xsd:anyURI in json-ld. However json-ld.org uses "@type": "@id" in the context of type coercion:

{
  "@context":
  {
    ...
    "homepage":
    {
      "@id": "http://schema.org/homepage",
      "@type": "@id"
    }
    ...
  }
...
  "homepage": "http://manu.sporny.org/",
...
}

This approach seems to work in our case as well:

{
  "@id": "https://datascience.ch",
  "http://schema.org/logo": {
    "@value": "https://datascience.ch/wp-content/uploads/2019/04/logo_SDSC-300x82.png",
    "@type": "@id"
  },
  "@type": [
    "http://schema.org/Organization"
  ]
}

Converted to the following ttl:

@prefix schema: <http://schema.org/>.

<https://datascience.ch>
    a schema:Organization;
    schema:logo
    <https://datascience.ch/wp-content/uploads/2018/04/logo_SDSC-300x82.png>.

Which is effectively the same as just omitting "@type". I see two potential solutions:

I would lean towards the latter because with the conversion sequence json-ld > ttl > json-ld, it retains the same representation. By contrast, the first option loses the @type. What do you think?

cmdoret commented 1 year ago

Below are all the turtle serializations we considered for an IRI:

  1. "myURI.com"^^xsd:anyURI :x: Not what we want (string that looks like a URI)
    1. <myURI.com> :heavy_check_mark:
    2. <myURI.com>^^xsd:anyURI :x: Invalid turtle syntax
    3. <myURI.com> rdf:type xsd:anyURI :x: Incorrect: rdf:type is not appropriate for xsd datatypes

Which can just be serialized as { "@id": "myURI.com" }. This is what the last commit does:

{
  "@id": "https://datascience.ch",
  "http://schema.org/logo": {
    "@id": "https://datascience.ch/wp-content/uploads/2019/04/logo_SDSC-300x82.png"
  },
  "@type": [
    "http://schema.org/Organization"
  ]
}

If you think add_value_types=True should explicitely add "@type": "@id", I can change it.

Panaetius commented 1 year ago

I think add_value_types not doing anything on IRI fields makes sense :+1:

Panaetius commented 1 year ago

I had to change some settings to run tests on PRs created from forks. You can merge now :slightly_smiling_face: