NASA-PDS / doi-service

Service and tools for generating DOIs for PDS bundles, collections, and data sets
https://nasa-pds.github.io/doi-service
Other
2 stars 3 forks source link

Improve metadata based upon parameters ADS keys off of #282

Closed jordanpadams closed 1 year ago

jordanpadams commented 2 years ago

💪 Motivation

...so that I can integrate more seamless with ADS

📖 Additional Details

Per info from Anne R., here are the fields that are highest priority for ADS searches:

⚖️ Acceptance Criteria

Given a DOI When I perform a query of that DOI from the DOI service or DataCite search Then I expect the metadata returned for that DOI to contain the improvements described above

⚙️ Engineering Details

alexdunnjpl commented 1 year ago

@jordanpadams what's the definition of the "available" date? Should be trivial to add under the dates attribute as

{
  "date": $someValue,
  "dateType": "Available"
  "dateInformation": <let me know if something should go here - perhaps the definition of that date?>
}

Will need to add to search criteria for DOICoreActionList.

alexdunnjpl commented 1 year ago

Global (per-deployment) keywords are populated from a line in the configuration.

It appears that setting global keywords in the config is the only method currently implemented - every mention of mutation of .keywords uses get_global_keywords()

Keywords are not currently implemented as search criteria in DOICoreActionList.

@jordanpadams what is the desired query functionality here?

alexdunnjpl commented 1 year ago

types attribute has properties resourceType (freeform string, mapped to Doi.product_type_specific) and resourceTypeGeneral (schema-enumerated string, mapped to Doi.product_type and enum values ProductType)

Both properties are mapped from the product_class pds4 field. @jordanpadams please advise whether any updates to these mappings are necessary.

relatedIdentifiers attribute also has an optional resourceTypeGeneral property, but we don't appear to be setting that directly, anywhere.

alexdunnjpl commented 1 year ago

The References requirement is too vague to do much with. @jordanpadams please advise.

I couldn't find any existing references to citations of other products. Is this related?

"relatedIdentifiers": {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "relatedIdentifier": {"type": "string"},
            "relatedIdentifierType": {"$ref": "#/definitions/relatedIdentifierType"},
            "relationType": {"$ref": "#/definitions/relationType"},
            "relatedMetadataScheme": {"type": "string"},
            "schemeURI": {"type": "string", "format": "uri"},
            "schemeType": {"type": "string"},
            "resourceTypeGeneral": {"$ref": "#/definitions/resourceTypeGeneral"}
        },
        "required": ["relatedIdentifier", "relatedIdentifierType", "relationType"],
        "if": {
            "properties": {
                "relationType": {"enum": ["HasMetadata", "IsMetadataFor"]}
            }
        },
        "else": {
            "$comment": "these properties may only be used with relation types HasMetadata/IsMetadataFor",
            "properties": {
                "relatedMetadataScheme": false,
                "schemeURI": false,
                "schemeType": false
            }
        }
    },
    "uniqueItems": true
}
alexdunnjpl commented 1 year ago

Regarding the ORCIDs, where are the existing metadata guidelines? I can't see them in the docs.

Is the idea that ORCID will be a new, optional field for submitted DOIs?

jordanpadams commented 1 year ago

@jordanpadams what's the definition of the "available" date? Should be trivial to add under the dates attribute as

{
  "date": $someValue,
  "dateType": "Available"
  "dateInformation": <let me know if something should go here - perhaps the definition of that date?>
}

Will need to add to search criteria for DOICoreActionList.

@alexdunnjpl sounds great. for dateInformation maybe let's put something like "Date of first publication"

jordanpadams commented 1 year ago

Global (per-deployment) keywords are populated from a line in the configuration.

It appears that setting global keywords in the config is the only method currently implemented - every mention of mutation of .keywords uses get_global_keywords()

Keywords are not currently implemented as search criteria in DOICoreActionList.

@jordanpadams what is the desired query functionality here?

  • is there a need to query for multiple keywords simultaneously?
  • if so, is there a need to support union and intersection?
  • if so, is there a need to support nested boolean logic?

@alexdunnjpl sorry about the confusion here. I think there may be somewhere else in the code where these values are appended to something where additional keywords are auto-generated. the confusion is our code mentions keywords (leftover from when we used OSTI as our DOI provider) versus the current DataCite metadata calls these subjects. For instance for DOI 10.17189/rbz8-2327, the keywords/subjects generated were:

                "subjects": [
                    { "subject": "PDS" },
                    { "subject": "PDS4" },
                    { "subject": "code" },
                    { "subject": "collection" },
                    { "subject": "consists" },
                    { "subject": "fortran" },
                    { "subject": "kmag" },
                    { "subject": "python" },
                    { "subject": "saturn" },
                    { "subject": "wrapper" }
                ],

These subjects/keywords are actually not intended to really be searchable from the PDS or even from the DOI Service search, it is really intended to be searched from ADS.

So if I remember correctly, I think subjects are populated with 3 sets of values:

  1. config values (PDS, PDS4)
  2. keywords parsed from the label ( e.g. see this label)
  3. some auto-generated keywords using some python library that parses the metadata and picks some keywords from the strings

I would say we go with 1 and 2 above, but let's remove 3. Let me know if you cannot track this down. In which case, we can wait until Thomas gets back and ask him about it.

jordanpadams commented 1 year ago

The References requirement is too vague to do much with. @jordanpadams please advise.

I couldn't find any existing references to citations of other products. Is this related?

"relatedIdentifiers": {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "relatedIdentifier": {"type": "string"},
            "relatedIdentifierType": {"$ref": "#/definitions/relatedIdentifierType"},
            "relationType": {"$ref": "#/definitions/relationType"},
            "relatedMetadataScheme": {"type": "string"},
            "schemeURI": {"type": "string", "format": "uri"},
            "schemeType": {"type": "string"},
            "resourceTypeGeneral": {"$ref": "#/definitions/resourceTypeGeneral"}
        },
        "required": ["relatedIdentifier", "relatedIdentifierType", "relationType"],
        "if": {
            "properties": {
                "relationType": {"enum": ["HasMetadata", "IsMetadataFor"]}
            }
        },
        "else": {
            "$comment": "these properties may only be used with relation types HasMetadata/IsMetadataFor",
            "properties": {
                "relatedMetadataScheme": false,
                "schemeURI": false,
                "schemeType": false
            }
        }
    },
    "uniqueItems": true
}

@alexdunnjpl sorry for the runaround here. let's scratch this. we can bring this back up at a later date if needed.

jordanpadams commented 1 year ago

types attribute has properties resourceType (freeform string, mapped to Doi.product_type_specific) and resourceTypeGeneral (schema-enumerated string, mapped to Doi.product_type and enum values ProductType)

Both properties are mapped from the product_class pds4 field. @jordanpadams please advise whether any updates to these mappings are necessary.

relatedIdentifiers attribute also has an optional resourceTypeGeneral property, but we don't appear to be setting that directly, anywhere.

@alexdunnjpl I think we are good here. I think I was just asking if we could take a look at the existing DOI metadata we have in DataCite and verify they match one of those expected values. Just wanted to make sure we didn't have any old DOIs out there that do not match this appropriately.

jordanpadams commented 1 year ago

Regarding the ORCIDs, where are the existing metadata guidelines? I can't see them in the docs.

Is the idea that ORCID will be a new, optional field for submitted DOIs?

@alexdunnjpl for starters, I cannot remember if we communicate with the DOI Editor with XML or JSON. If it is JSON, you can ignore below. We can just link to the metadata guidelines elsewhere on the DOI Editor and call it good.

Otherwise, if we are using XML, I am thinking we just include some commented out example of providing an ORCID and then we leave it to the user to input? I prefer self-documenting XML where possible, especially since these values cannot be pulled from the labels, so they must be manually input. if adding something commented out is not reasonable, that is OK too.

alexdunnjpl commented 1 year ago

@alexdunnjpl sounds great. for dateInformation maybe let's put something like "Date of first publication"

@jordanpadams still need an explicit definition for that date. Date of document Date of DOI reservation? Date of DOI release? Something else?

Regarding keywords/subjects, thanks for the extra info/context - will take a look and sort that out.

Regarding ORCIDs, what's the "DOI Editor"? We're sending DOI records as JSON payloads to DataCite, but it sounds like maybe you're talking about something else.

jordanpadams commented 1 year ago

still need an explicit definition for that date. Date of document Date of DOI reservation? Date of DOI release? Something else?

Date of first publication.

Regarding ORCIDs, what's the "DOI Editor"? We're sending DOI records as JSON payloads to DataCite, but it sounds like maybe you're talking about something else.

That answers my question. We will just handle this on the editor side of the house. https://pds-gamma.jpl.nasa.gov/tools/doi-editor/ (ping @viviant100 and SA team to gain access), https://github.com/NASA-PDS/doi-ui/

alexdunnjpl commented 1 year ago

Per jordanpadams, available date should be parsed from latest modification date in pds4 xml label, else returned as None

alexdunnjpl commented 1 year ago

Per jordanpadams

@alexdunnjpl so sorry to do this, but after talking to some other stakeholders, we came to the realization the modification date will not be accurate for "Available" date, since they may have modified it in September, but it was not released until December. Can we comment out that code for now until we have a better idea of how we will get this date from the metadata?

alexdunnjpl commented 1 year ago

Since the currently-implemented value is given as Doi.publication_date and not available_date and it's used throughout doi-service, recommend not changing anything until @tloubrieu-jpl is back.

alexdunnjpl commented 1 year ago

Taking a second look at the subjects/keywords item:

I can't find any mechanism for an additional (item 3) source of keywords.

The XML label linked above yields the following keywords, which seems correct:

{'bundle', 'image', 'secondary', 'data_imaging', 'primary', 'mars2020_pixl', 'camera', 'mcc', 'micro-context', 'micro', 'mars2020_imgops', 'rover', 'product', 'context', '2020', 'data', 'mars', 'member', 'perseverance', 'pixl', 'collection', 'data_mcc_imgops'}

Will consider this on ice until @tloubrieu-jpl is back.

alexdunnjpl commented 1 year ago

@tloubrieu-jpl just re-pinging you so you're aware this is still current/active

tloubrieu-jpl commented 1 year ago

@alexdunnjpl @jordanpadams I will ask for an introduction of this ticket during the breakout today

alexdunnjpl commented 1 year ago

Per @jordanpadams in breakout meeting, no changes required to available/published date, currently

alexdunnjpl commented 1 year ago

Per @jordanpadams @tloubrieu-jpl , remove description from source targets for keyword generation.

tloubrieu-jpl commented 1 year ago

Conclusion for now: