JSONLD creating resource with an array of '@id' values

seth-shaw-unlv commented 2 years ago

I noticed a warning pop up in my Milliner logs: [2022-04-07 09:08:39] app.WARNING: E_WARNING: parse_url() expects parameter 1 to be string, array given {"code":2,"message":"parse_url() expects parameter 1 to be string, array given","file":"/data/crayfish/Milliner/src/Service/MillinerService.php","line":318}

I went and checked the JSONLD of the object causing the warning and found one of the taxonomy terms refereced by the resource is causing a referenced object to have an '@id' to be an array instead of a single URI:

    {
      "@id": [
        "https://special.library.unlv.edu/taxonomy/term/1010",
        "https://special.library.unlv.edu/taxonomy/term/1010"
      ],
      "@type": [
        "http://purl.org/dc/terms/Location",
        "http://schema.org/Place",
        "http://purl.org/dc/terms/Location",
        "http://schema.org/Place"
      ]
    }

This may stem from what appears to be the same term referenced twice in the object's metadata (trimmed for brevity):

    {
      "@id": "https://special.library.unlv.edu/node/558518",
      "@type": [
        "http://pcdm.org/models#Object",
        "bibo:Issue"
      ],
      "http://purl.org/dc/terms/spatial": [
        {
          "@id": "https://special.library.unlv.edu/taxonomy/term/1010"
        },
        {
          "@id": "https://special.library.unlv.edu/taxonomy/term/1010"
        }
      ],

but I have not confirmed this yet, simply speculated based on observation. In any case, the JSON-LD parser shouldn't return an array for the '@id' property.

I don't think I'll have time to dig into this more right now, as it will still index in Fedora, and I can live with warnings in the logs for the time being, but I wanted to make sure it was documented.

whikloj commented 2 years ago

The ids are definitely a problem, but can you explain why the resource is using duplicate terms for the same field? I'm just looking for some example data.

seth-shaw-unlv commented 2 years ago

Ah, for this content type we have two fields, field_geographic_coverage and field_place_of_publication, both of which are RDF mapped to dcterms:spatial:

 field_geographic_coverage:
    properties:
      - 'dcterms:spatial'
  field_place_of_publication:
    properties:
      - 'dcterms:spatial'

Both of these fields, in this case, reference Las Vegas (it was published here and the news covers the area); ergo, two spatial references to the same term. We probably need better predicates for each of these, but I didn't write the map. 🤷‍♂️

whikloj commented 2 years ago

So the problem comes from here where we merge the property array back into the original array.

My example base array of

$normalized = {array} [1]
    @graph = {array} [3]
        http://localhost:8000/node/2 = {array} [8]
            @id = "http://localhost:8000/node/2"
            @type = {array} [1]
            http://schema.org/author = {array} [1]
            http://purl.org/dc/terms/title = {array} [1]
            http://schema.org/dateCreated = {array} [1]
            http://schema.org/dateModified = {array} [1]
            http://purl.org/dc/terms/extent = {array} [1]
            http://purl.org/dc/terms/spatial = {array} [1]
                0 = {array} [1]
                    @id = "http://localhost:8000/taxonomy/term/33"
        http://localhost:8000/taxonomy/term/33 = {array} [2]
            @id = {array} [1]
                0 = "http://localhost:8000/taxonomy/term/33"
            @type = {array} [2]
                0 = "http://purl.org/dc/terms/Location"
                1 = "http://schema.org/Place"

and the incoming property array

$normalized_property = {array} [1]
    @graph = {array} [2]
        http://localhost:8000/taxonomy/term/33 = {array} [2]
            @id = "http://localhost:8000/taxonomy/term/33"
            @type = {array} [2]
                0 = "http://purl.org/dc/terms/Location"
                1 = "http://schema.org/Place"
        http://localhost:8000/node/2 = {array} [1]
            http://purl.org/dc/terms/spatial = {array} [1]
                0 = {array} [1]
                    @id = "http://localhost:8000/taxonomy/term/33"

merge to become

$normalized = {array} [1]
    @graph = {array} [3]
        http://localhost:8000/node/2 = {array} [8]
            @id = "http://localhost:8000/node/2"
            @type = {array} [1]
            http://schema.org/author = {array} [1]
            http://purl.org/dc/terms/title = {array} [1]
            http://schema.org/dateCreated = {array} [1]
            http://schema.org/dateModified = {array} [1]
            http://purl.org/dc/terms/extent = {array} [1]
            http://purl.org/dc/terms/spatial = {array} [2]
                0 = {array} [1]
                    @id = "http://localhost:8000/taxonomy/term/33"
                1 = {array} [1]
                    @id = "http://localhost:8000/taxonomy/term/33"
            http://localhost:8000/user/1 = {array} [2]
            http://localhost:8000/taxonomy/term/33 = {array} [2]
                @id = {array} [2]
                     0 = "http://localhost:8000/taxonomy/term/33"
                     1 = "http://localhost:8000/taxonomy/term/33"
                @type = {array} [4]
                    0 = "http://purl.org/dc/terms/Location"
                    1 = "http://schema.org/Place"
                    2 = "http://purl.org/dc/terms/Location"
                    3 = "http://schema.org/Place"

It seems like some deduplication of the array might be valid. Edit had the arrays reversed to their descriptions.

jordandukart commented 2 years ago

Maybe look at https://git.drupalcode.org/project/drupal/-/blob/9.4.x/core/lib/Drupal/Component/Utility/NestedArray.php#L267-294 as opposed to using array_merge_recursive?

whikloj commented 2 years ago

Oh yeah, I saw we are using that here. I'll take a look, I'm a little worried you might want some duplicates but not all 🤷

whikloj commented 2 years ago

Ok I have worked this, one last piece that could be cleaned but also could be considered valuable as is. If you have two fields pointing to the same entity and sharing the same RDF mapped predicate, the referenced entity looks good.

{
    "@id": "http://localhost:8000/taxonomy/term/33",
    "@type": [
      "http://purl.org/dc/terms/Location",
      "http://schema.org/Place"
    ]
  }

but it still appears as two entries in the main node.

"http://purl.org/dc/terms/spatial": [
      {
        "@id": "http://localhost:8000/taxonomy/term/33"
      },
      {
        "@id": "http://localhost:8000/taxonomy/term/33"
      }
    ],

because NestedArray::mergeDeep doesn't do any deduplication, so I wrote a simple function to deduplicate the @types, if desired I could try to expand it to deduplicate the predicate arrays based on the @id values. Not sure how people feel about that.

Here's my current work

seth-shaw-unlv commented 2 years ago

I'm fine with allowing duplicate predicate values as long as Fedora/Blazegraph don't choke on it. I don't think they would, but it would be best to check to make sure before spending more effort with further de-duplication.

whikloj commented 2 years ago

Looks like both Fedora and Blazegraph only allow a single unique triple, so they are de-duplicated. Not sure if we should be relying on external software for that.

Also, looking at the RDF 1.1 specification it states here

The core structure of the abstract syntax is a set of triples, each consisting of a subject, a predicate and an object. A set of such triples is called an RDF graph.

Being a "set" would mean that you could not hold duplicate triples. So we probably should remove the duplicates, but it doesn't seem to be vital at this moment.

seth-shaw-unlv commented 2 years ago

Resolved with https://github.com/Islandora/jsonld/commit/630e10947dc0a5517f0fe0b29adab0c2b6247ce9

Islandora / documentation

JSONLD creating resource with an array of '@id' values #2082