Closed iannesbitt closed 1 year ago
Edit: I got my jsonld variables mixed up initially. I posted the correct json-ld outputs and correct function attribution (sonormal.normalize.frameSODataset
) in the post above.
The error happens in pyld.jsonld.frame()
but I don't know if it has to do with the contents of sonormal.SO_DATASET_FRAME
or due to an issue within the function itself.
Another possibility is setting require_identifier = False
in sonormalizepipeline, but I'm not sure what would serve as SID in that case...
It's a silly problem to have because there are several places where http://schema.org/identifier
value is set in the graph...just not in the place where sonormal.normalize.getDatasetsIdentifiers
is looking. For example:
>>> _framed[0]["http://schema.org/includedInDataCatalog"][0]["http://schema.org/identifier"][0]["@list"][0]["http://schema.org/value"][0]["@value"]
'10.5069/G9TB14TM'
or even taking it from the graph a step before framing occurs
>>> sosodoc[0]["http://schema.org/identifier"][0]["@list"][0]["http://schema.org/value"][0]["@value"]
'OTLAS.042012.26911.5'
I have isolated this issue to framing and not software. This seems to be a problem with the identifier tag itself.
Probably shouldn't be used but here's the code change in sonormal.normalize
that would allow scraping @id
in addition to @value
:
def _getIdentifiers(doc):
ids = []
v = doc.get("@value", None)
if not v is None:
ids.append(v)
return ids
vs = doc.get(sonormal.SO_VALUE, [])
for av in vs:
v = av.get("@value", None)
if v is not None:
ids.append(v)
+ v = doc.get("@id", None)
+ if not v is None:
+ ids.append(v)
+ return ids
return ids
Got the ok from @rushirajnenuji to make this change as @id
can replace other values in the identifier
field during framing.
Closing as fixed.
Got the ok from @rushirajnenuji to make this change as @id can replace other values in the identifier field during framing.
@iannesbitt we should discuss the implications of this change before you finalize it please. Let's discuss during our dev meeting on Thursday.
Ok. I still have some questions about it as well. Thank you for catching this.
This issue has been determined to stem from the use of dual @id
tags in the JSON-LD document. It seems that framing gets rid of other fields in the identifier
when @id
exists within it and when an @id
tag exists at the level above.
Framing example with and without identifier
> @id
.
Instead of a code change, I will contact OpenTopography directly. See DataONEorg/member-repos#15
Every OpenTopography record raises
JSON-LD no ids, not a Dataset:
because there is noidentifier
field returned whenmnlite.sonormalizepipeline.SoscanNormalizePipeline.process_item()
attempts to findids
. The schema.org records look valid even to the validator tools. It looks like the error gets introduced ~whensonormal.sosoNormalize
callssonormal.addSchemaOrgListContainer
which usespyld.jsonld.expand
~ duringsonormal.normalize.frameSODataset
, leaving only the following to be returned in theids
variable (emptyidentifier
):Here is the dataset
identifier
field before and after normalization (i.e. whatsonormal.normalize._getIdentifiers
looks at to getids
.Before expansion:
After expansion but before compaction:
After initial compaction:
after
sonormal.addSchemaOrgListContainer
:after
sonormal.normalize.frameSODataset
:Potential solutions:
sonormal.normalize._getIdentifiers
to include the@id
tag in its search (currently only gets@value
) <- quick fix — I tested this and it's working, but it uses the whole URL as SID which isn't idealpyld.jsonld.frame
) or expansion (pyld.jsonld.expand
) (or the frame provided bysonormal.normalize.frameSODataset
) such that the output keeps'value': 'OTLAS.042012.26911.5'
in the identifier listPerhaps @datadavev has experience with this issue or something similar?