DataONEorg / mnlite

Light weight read-only DataONE member node in Python Flask
Apache License 2.0
0 stars 0 forks source link

OpenTopography's records do not resolve an `identifier` #31

Closed iannesbitt closed 1 year ago

iannesbitt commented 1 year ago

Every OpenTopography record raises JSON-LD no ids, not a Dataset: because there is no identifier field returned when mnlite.sonormalizepipeline.SoscanNormalizePipeline.process_item() attempts to find ids. The schema.org records look valid even to the validator tools. It looks like the error gets introduced ~when sonormal.sosoNormalize calls sonormal.addSchemaOrgListContainer which uses pyld.jsonld.expand~ during sonormal.normalize.frameSODataset, leaving only the following to be returned in the ids variable (empty identifier):

[{'@id': ['https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042012.26911.5'], 'url': ['https://doi.org/10.5069/G9TB14TM'], 'identifier': []}]

Here is the dataset identifier field before and after normalization (i.e. what sonormal.normalize._getIdentifiers looks at to get ids.

Before expansion:

            'identifier': {'@id': 'https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042012.26911.5',
                           '@type': 'PropertyValue',
                           'propertyID': 'opentopoID',
                           'value': 'OTLAS.042012.26911.5'},

After expansion but before compaction:

    "https://schema.org/identifier": [
      {"@id": "https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042012.26911.5",
        "@type": ["https://schema.org/PropertyValue"],
        "https://schema.org/propertyID": [{"@value": "opentopoID"}],
        "https://schema.org/value": [{"@value": "OTLAS.042012.26911.5"}]
      }],

After initial compaction:

  "identifier": {
    "@id": "https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042012.26911.5",
    "@type": "PropertyValue",
    "propertyID": "opentopoID",
    "value": "OTLAS.042012.26911.5"
  },

after sonormal.addSchemaOrgListContainer:

    "http://schema.org/identifier": [
      {"@list": [{
            "@id": "https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042012.26911.5",
            "@type": ["http://schema.org/PropertyValue"],
            "http://schema.org/propertyID": [{"@value": "opentopoID"}],
            "http://schema.org/value": [{"@value": "OTLAS.042012.26911.5"}]
      }]}],

after sonormal.normalize.frameSODataset:

    "http://schema.org/identifier": [{"@list":
          [{"@id": "https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042012.26911.5"}]
    }],

Potential solutions:

  1. Change sonormal.normalize._getIdentifiers to include the @id tag in its search (currently only gets @value) <- quick fix — I tested this and it's working, but it uses the whole URL as SID which isn't ideal
  2. Change either framing (pyld.jsonld.frame) or expansion (pyld.jsonld.expand) (or the frame provided by sonormal.normalize.frameSODataset) such that the output keeps 'value': 'OTLAS.042012.26911.5' in the identifier list
  3. Contact OT to have them change their SO template to something normalization can handle

Perhaps @datadavev has experience with this issue or something similar?

iannesbitt commented 1 year ago

Edit: I got my jsonld variables mixed up initially. I posted the correct json-ld outputs and correct function attribution (sonormal.normalize.frameSODataset) in the post above.

iannesbitt commented 1 year ago

The error happens in pyld.jsonld.frame() but I don't know if it has to do with the contents of sonormal.SO_DATASET_FRAME or due to an issue within the function itself.

Another possibility is setting require_identifier = False in sonormalizepipeline, but I'm not sure what would serve as SID in that case...

It's a silly problem to have because there are several places where http://schema.org/identifier value is set in the graph...just not in the place where sonormal.normalize.getDatasetsIdentifiers is looking. For example:

>>> _framed[0]["http://schema.org/includedInDataCatalog"][0]["http://schema.org/identifier"][0]["@list"][0]["http://schema.org/value"][0]["@value"]
'10.5069/G9TB14TM'

or even taking it from the graph a step before framing occurs

>>> sosodoc[0]["http://schema.org/identifier"][0]["@list"][0]["http://schema.org/value"][0]["@value"]
'OTLAS.042012.26911.5'
iannesbitt commented 1 year ago

I have isolated this issue to framing and not software. This seems to be a problem with the identifier tag itself.

iannesbitt commented 1 year ago

Probably shouldn't be used but here's the code change in sonormal.normalize that would allow scraping @id in addition to @value:

def _getIdentifiers(doc):
    ids = []
    v = doc.get("@value", None)
    if not v is None:
        ids.append(v)
        return ids
    vs = doc.get(sonormal.SO_VALUE, [])
    for av in vs:
        v = av.get("@value", None)
        if v is not None:
            ids.append(v)
+   v = doc.get("@id", None)
+   if not v is None:
+       ids.append(v)
+       return ids
    return ids
iannesbitt commented 1 year ago

Got the ok from @rushirajnenuji to make this change as @id can replace other values in the identifier field during framing.

iannesbitt commented 1 year ago

Closing as fixed.

mbjones commented 1 year ago

Got the ok from @rushirajnenuji to make this change as @id can replace other values in the identifier field during framing.

@iannesbitt we should discuss the implications of this change before you finalize it please. Let's discuss during our dev meeting on Thursday.

iannesbitt commented 1 year ago

Ok. I still have some questions about it as well. Thank you for catching this.

iannesbitt commented 1 year ago

This issue has been determined to stem from the use of dual @id tags in the JSON-LD document. It seems that framing gets rid of other fields in the identifier when @id exists within it and when an @id tag exists at the level above.

Framing example with and without identifier > @id.

Instead of a code change, I will contact OpenTopography directly. See DataONEorg/member-repos#15