Closed seth-shaw-unlv closed 2 years ago
The ids are definitely a problem, but can you explain why the resource is using duplicate terms for the same field? I'm just looking for some example data.
Ah, for this content type we have two fields, field_geographic_coverage and field_place_of_publication, both of which are RDF mapped to dcterms:spatial:
field_geographic_coverage:
properties:
- 'dcterms:spatial'
field_place_of_publication:
properties:
- 'dcterms:spatial'
Both of these fields, in this case, reference Las Vegas (it was published here and the news covers the area); ergo, two spatial references to the same term. We probably need better predicates for each of these, but I didn't write the map. 🤷♂️
So the problem comes from here where we merge the property array back into the original array.
My example base array of
$normalized = {array} [1]
@graph = {array} [3]
http://localhost:8000/node/2 = {array} [8]
@id = "http://localhost:8000/node/2"
@type = {array} [1]
http://schema.org/author = {array} [1]
http://purl.org/dc/terms/title = {array} [1]
http://schema.org/dateCreated = {array} [1]
http://schema.org/dateModified = {array} [1]
http://purl.org/dc/terms/extent = {array} [1]
http://purl.org/dc/terms/spatial = {array} [1]
0 = {array} [1]
@id = "http://localhost:8000/taxonomy/term/33"
http://localhost:8000/taxonomy/term/33 = {array} [2]
@id = {array} [1]
0 = "http://localhost:8000/taxonomy/term/33"
@type = {array} [2]
0 = "http://purl.org/dc/terms/Location"
1 = "http://schema.org/Place"
and the incoming property array
$normalized_property = {array} [1]
@graph = {array} [2]
http://localhost:8000/taxonomy/term/33 = {array} [2]
@id = "http://localhost:8000/taxonomy/term/33"
@type = {array} [2]
0 = "http://purl.org/dc/terms/Location"
1 = "http://schema.org/Place"
http://localhost:8000/node/2 = {array} [1]
http://purl.org/dc/terms/spatial = {array} [1]
0 = {array} [1]
@id = "http://localhost:8000/taxonomy/term/33"
merge to become
$normalized = {array} [1]
@graph = {array} [3]
http://localhost:8000/node/2 = {array} [8]
@id = "http://localhost:8000/node/2"
@type = {array} [1]
http://schema.org/author = {array} [1]
http://purl.org/dc/terms/title = {array} [1]
http://schema.org/dateCreated = {array} [1]
http://schema.org/dateModified = {array} [1]
http://purl.org/dc/terms/extent = {array} [1]
http://purl.org/dc/terms/spatial = {array} [2]
0 = {array} [1]
@id = "http://localhost:8000/taxonomy/term/33"
1 = {array} [1]
@id = "http://localhost:8000/taxonomy/term/33"
http://localhost:8000/user/1 = {array} [2]
http://localhost:8000/taxonomy/term/33 = {array} [2]
@id = {array} [2]
0 = "http://localhost:8000/taxonomy/term/33"
1 = "http://localhost:8000/taxonomy/term/33"
@type = {array} [4]
0 = "http://purl.org/dc/terms/Location"
1 = "http://schema.org/Place"
2 = "http://purl.org/dc/terms/Location"
3 = "http://schema.org/Place"
It seems like some deduplication of the array might be valid. Edit had the arrays reversed to their descriptions.
Maybe look at https://git.drupalcode.org/project/drupal/-/blob/9.4.x/core/lib/Drupal/Component/Utility/NestedArray.php#L267-294 as opposed to using array_merge_recursive
?
Oh yeah, I saw we are using that here. I'll take a look, I'm a little worried you might want some duplicates but not all 🤷
Ok I have worked this, one last piece that could be cleaned but also could be considered valuable as is. If you have two fields pointing to the same entity and sharing the same RDF mapped predicate, the referenced entity looks good.
{
"@id": "http://localhost:8000/taxonomy/term/33",
"@type": [
"http://purl.org/dc/terms/Location",
"http://schema.org/Place"
]
}
but it still appears as two entries in the main node.
"http://purl.org/dc/terms/spatial": [
{
"@id": "http://localhost:8000/taxonomy/term/33"
},
{
"@id": "http://localhost:8000/taxonomy/term/33"
}
],
because NestedArray::mergeDeep
doesn't do any deduplication, so I wrote a simple function to deduplicate the @type
s, if desired I could try to expand it to deduplicate the predicate arrays based on the @id
values. Not sure how people feel about that.
Here's my current work
I'm fine with allowing duplicate predicate values as long as Fedora/Blazegraph don't choke on it. I don't think they would, but it would be best to check to make sure before spending more effort with further de-duplication.
Looks like both Fedora and Blazegraph only allow a single unique triple, so they are de-duplicated. Not sure if we should be relying on external software for that.
Also, looking at the RDF 1.1 specification it states here
The core structure of the abstract syntax is a set of triples, each consisting of a subject, a predicate and an object. A set of such triples is called an RDF graph.
Being a "set" would mean that you could not hold duplicate triples. So we probably should remove the duplicates, but it doesn't seem to be vital at this moment.
I noticed a warning pop up in my Milliner logs:
[2022-04-07 09:08:39] app.WARNING: E_WARNING: parse_url() expects parameter 1 to be string, array given {"code":2,"message":"parse_url() expects parameter 1 to be string, array given","file":"/data/crayfish/Milliner/src/Service/MillinerService.php","line":318}
I went and checked the JSONLD of the object causing the warning and found one of the taxonomy terms refereced by the resource is causing a referenced object to have an '@id' to be an array instead of a single URI:
This may stem from what appears to be the same term referenced twice in the object's metadata (trimmed for brevity):
but I have not confirmed this yet, simply speculated based on observation. In any case, the JSON-LD parser shouldn't return an array for the '@id' property.
I don't think I'll have time to dig into this more right now, as it will still index in Fedora, and I can live with warnings in the logs for the time being, but I wanted to make sure it was documented.