NASA-PDS / registry-api

Web API service for the PDS Registry, providing the implementation of the PDS Search API (https://github.com/nasa-pds/pds-api) for the PDS Registry.
https://nasa-pds.github.io/pds-api
Other
2 stars 5 forks source link

As a user, I want to query-filter products by collection- and/or bundle- membership #298

Open jordanpadams opened 1 year ago

jordanpadams commented 1 year ago

Checked for duplicates

Yes - I've already checked

πŸ§‘β€πŸ”¬ User Persona(s)

Data User

πŸ’ͺ Motivation

...so that I can search for the products of a bundle/collection, and then provide additional filters to the search within the same query.

πŸ“– Additional Details

Follow-on to https://github.com/NASA-PDS/registry-api/issues/197, we do not want to support q= from a /members or /member-of endpoint, so we need to some other way to provide this query functionality.

Acceptance Criteria

Given When I perform Then I expect

βš™οΈ Engineering Details

Initial design idea is to update provenance script to include adding the collection_lidvid and bundle_lidvid to each product.

In order the support the example from #197

curl --get 'http://pds.nasa.gov/api/search-en-gamma/1.1/classes/collections/urn:nasa:pds:gbo.pluto-charon.mutual-events:data::1.0/members' --data-urlencode 'q=pds:External_Reference.pds:reference_text eq "YOUNG1992"' -H Accept:application/json -L | json_pp

the API query would instead be something like:

/products?q=collection_lidvid eq "urn:nasa:pds:gbo.pluto-charon.mutual-events:data::1.0" AND pds:External_Reference.pds:reference_text eq "YOUNG1992"
alexdunnjpl commented 1 year ago

@jordanpadams @tloubrieu-jpl for all that I'd much prefer to write a python script and be done with it, isn't this strictly a job for harvest?

Pros: (Python script - should be separate from provenance imho but that's whatever)

Cons:

I suppose the correct solution is to bandaid it with a python script, implement it in harvest as well, then rip off the bandaid once the updated version of harvest is deployed everywhere it needs to be.

Note to self - LIDs are strictly-defined in the PDS Standards Reference as urn:<national_agency>:<archiving_agency>:<bundle>:<?collection>:<?product>, so it's trivial to split and extract bundle/collection by chunk index.

alexdunnjpl commented 1 year ago

85ca61e7c7d3464f3b8ae12968ca5a7ff23fac02 implements addition of membership metadata to products whose documents lack such (to prevent having to update every product on every script run)

Metadata is currently written to the document in this format. I'm assuming the nesting isn't a problem but I can tweak it to a flat structure if need be.

All products will have that full membership metadata structure, with null indicating lack of membership (collections have no collection membership, bundles have neither membership).

Ensuring that this structure is included in the index is an outstanding question. @jordanpadams @jimmie @al-niessner @tloubrieu-jpl would it be appropriate for the script to ensure presence of these fields in the index? I wouldn't think reindexing is necessary in that case as on first run, the index would be added and then the relevant metadata would be written for all products (triggering indexing on each product).

jordanpadams commented 1 year ago

@alexdunnjpl just as an FYI, even though the standard says this:

LIDs are strictly-defined in the PDS Standards Reference

That is not actually always the case. There is an alternate_ids field that was added to the registry a while back to support backwards compatibility there because there are cases where a new version of a product contained a different LID.

jordanpadams commented 1 year ago

@alexdunnjpl per:

Metadata is currently written to the document in this format. I'm assuming the nesting isn't a problem but I can tweak it to a flat structure if need be.

how would a user then query for that information based upon it's nesting? for other metadata we have added to the registry, we have been flattening it for the time being, e.g. ops:Harvest_Info/ops:archive_status. We may want to stick to that paradigm for the time being?

jordanpadams commented 1 year ago

@alexdunnjpl per:

for all that I'd much prefer to write a python script and be done with it, isn't this strictly a job for harvest?

similar to the provenance script, we could do this in harvest, but there are a few reasons why we want this in a separate script (e.g. within the provenance script):

  1. there is no requirement for a user to execute harvest on a bundle. it could be a collection or directory or an individual file, so we can't assume harvest has any more information than some separate standalone script.
  2. performance - having to keep all this information, query the registry, etc., slows down the execution, which is constantly a complaint from our users. we need to do as little work as we can alongside the data, and do the rest of the processing on our end.