RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 556 forks source link

Using Function STRDT for Dates Encoded as Strings #1723

Open tobiasschweizer opened 2 years ago

tobiasschweizer commented 2 years ago

Hi there

I am using rdflib to convert RDF data from one ontology to another using SPAQRL CONSTRUCT queries. In general, it works well :-)

I encountered a problem with dates that are encoded as strings in the source data.

source data data.json (RiCO: https://www.ica.org/standards/RiC/RiC-O_v0-2.html):

[
  {
    "@id": "http://www.example.com/1",
    "@type": "https://www.ica.org/standards/RiC/ontology#Record",
    "http://purl.org/dc/terms/created": {
      "@id": "_:b107"
    }
  },
  {
    "@id": "_:b107",
    "@type": "https://www.ica.org/standards/RiC/ontology#DateRange",
    "https://www.ica.org/standards/RiC/ontology#normalizedDateValue": "1964/1966"
  }
]

conversion:

from rdflib import Graph
from rdflib.query import Result

sparql_construct_query = """
PREFIX schema: <http://schema.org/>
PREFIX rico: <https://www.ica.org/standards/RiC/ontology#>
PREFIX dcterms: <http://purl.org/dc/terms/>

CONSTRUCT {
    ?record a schema:Dataset ;
        schema:dateCreated ?recCreationDate .
} WHERE {
    ?record a rico:Record ;
        dcterms:created ?recCreationDateObj .

    ?recCreationDateObj rico:normalizedDateValue ?recCreationDateStr .
    # https://www.w3.org/TR/sparql11-query/#func-strdt
    BIND(STRDT(?recCreationDateStr, xsd:date) as ?recCreationDate) 
}

"""

g: Graph = Graph()
g.parse('data.json')

qres: Result = g.query(sparql_construct_query)

qres.serialize(destination='data-converted.json', format='json-ld')

result (schema.org)

[
  {
    "@id": "http://www.example.com/1",
    "@type": [
      "http://schema.org/Dataset"
    ],
    "http://schema.org/dateCreated": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "1964-01-01"
      }
    ]
  }
]

The date range string "1964/1966" (https://www.ica.org/standards/RiC/RiC-O_v0-2.html#normalizedDateValue) is converted to an xsd:date "1964-01-01".

I had a look at STRDT and if I got it right, isoformatis used and performs some conversion.

Looking at isoformat (datetime.py) two sources are mentioned:

https://www.w3.org/TR/NOTE-datetime says:

This document defines a profile of ISO 8601, the International Standard for the representation of dates and times. ISO 8601 describes a large number of date/time formats. To reduce the scope for error and the complexity of software, it is useful to restrict the supported formats to a small number. This profile defines a few date/time formats, likely to satisfy most requirements.

Whereas ISO 8601 supports time intervals, the note says that only a part of ISO 8601 is actually supported.

Would it be correct to a have an xsd:date "1964/1966" in RDF? And if so, should rdflib's STRDT rather not convert "1964/1966" to "1964-01-01"?

tobiasschweizer commented 2 years ago

Looking at https://www.w3.org/TR/xmlschema-2/#date, I think an xsd:date should be a single date with day precision. Otherwise, gYear or gYearMonth could be used.

edmondchuc commented 2 years ago

Since you're using schema.org, you can use schema:temporalCoverage to represent date ranges.

From the schema:temporalCoverage docs:

The temporalCoverage of a CreativeWork indicates the period that the content applies to, i.e. that it describes, either as a DateTime or as a textual string indicating a time period in ISO 8601 time interval format. In the case of a Dataset it will typically indicate the relevant time period in a precise notation (e.g. for a 2011 census dataset, the year 2011 would be written "2011/2012").

You can save it as a string instead of xsd:date to avoid RDFLib performing any conversions. Saving schema:temporalCoverage values as strings is valid, according to the docs.

tobiasschweizer commented 2 years ago

@edmondchuc Thanks a lot for your reply.

We're actually also using schema:temporalCoverage on schema:CreativeWork but I think it has a different meaning than schema:dateCreated. If you take the example of a radio broadcast, schema:dateCreated refers to the date when the broadcast itself was created etc. but there could still be a schema:temporalCoverage indicating the time of the broadcast's contents, e.g., a radio show about ancient history.

However, in terms of RDF I just want to make sure that my understanding is correct that an xsd:date is a single date with day precision. That would also explain the implementation of STRDT in rdflib.

edmondchuc commented 2 years ago

@tobiasschweizer I totally agree with you that schema:temporalCoverage and schema:dateCreated are different.

I just never imagined the value for schema:dateCreated would be a range. Why is it a range? Was it initially created in 1964 and then later updated in 1966? If so, the updated date should captured with schema:dateModified, but I'm just making assumptions.

In any case, I think we've gone off topic from the original issue, sorry about that.


Looking at https://www.w3.org/TR/xmlschema-2/#date, I think an xsd:date should be a single date with day precision.

I think this is the case, yes.