NASA-PDS / deep-archive

PDS Open Archival Information System (OAIS) utilities, including Submission Information Package (SIP) and Archive Information Package (AIP) generators
https://nasa-pds.github.io/deep-archive/
Other
7 stars 4 forks source link

As a data custodian, I want the Deep Archive to work around invalid URLs in the Registry #162

Closed nutjob4life closed 4 months ago

nutjob4life commented 5 months ago

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

Users like the ones in https://github.com/NASA-PDS/operations/issues/476

💪 Motivation

The Registry API seems to be loaded with some bad data, namely file paths like

'ops:Data_File_Info.ops:file_ref': ['https://pds-rings.seti.org/pds4/bundles/cassini_uvis_solarocc_beckerjarmak2023//data/collection_data.csv']

with a // between cassini_uvis_solarocc_beckerjarmak2023 and data. This causes the Deep Archive to output Submission Information Packages with double-slashes in them too, causing validation errors.

📖 Additional Details

See https://github.com/NASA-PDS/operations/issues/476 for a specific example.

Acceptance Criteria

Given a document in OpenSearch containing double-slashes in the URL path When I perform pds-deep-registry-archive on the bundle containing that document Then I expect the file paths and URLs output to be "cleaned" up to single slashes

⚙️ Engineering Details

No response

jordanpadams commented 4 months ago

@nutjob4life I just realized I never triaged this and we probably need this cleaned up ASAP to unblock that operations ticket.

nutjob4life commented 4 months ago

@jordanpadams on it!

jordanpadams commented 4 months ago

Thanks @nutjob4life 🎉

gxtchen commented 1 month ago

@nutjob4life what url should I use to run the deep-registry-archive? I got "ValueError: 🤷‍♀️ The bundle urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1 cannot be found in the registry at https://pds.nasa.gov/api/search/1.0/"

nutjob4life commented 1 month ago

Hi @gxtchen, I don't know the answer to this.

I believe the URL is correct but perhaps the registry is missing some data? @tloubrieu-jpl @jordanpadams could you take a peek? When I run it, I get the same thing:

mirasol 209 % .v/bin/pds-deep-registry-archive --site PDS_RNG urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1
INFO 👟 PDS Deep Registry-based Archive, version 1.3.0
ERROR 💥 We got an unexpected error; sorry it didn't work out
Traceback (most recent call last):
  File "/Users/kelly/Documents/Clients/JPL/PDS/Development/nasa-pds/deep-archive/src/pds2/aipgen/registry.py", line 375, in main
    generatedeeparchive(args.url, args.bundle, args.site, not args.include_latest_collection_only)
  File "/Users/kelly/Documents/Clients/JPL/PDS/Development/nasa-pds/deep-archive/src/pds2/aipgen/registry.py", line 350, in generatedeeparchive
    prefixlen, bac, title = _comprehendregistry(url, bundlelidvid, allcollections)
  File "/Users/kelly/Documents/Clients/JPL/PDS/Development/nasa-pds/deep-archive/src/pds2/aipgen/registry.py", line 224, in _comprehendregistry
    raise ValueError(f"🤷‍♀️ The bundle {bundlelidvid} cannot be found in the registry at {url}")
ValueError: 🤷‍♀️ The bundle urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1 cannot be found in the registry at https://pds.nasa.gov/api/search/1.0/
INFO 👋 Thanks for using this program! Bye!
jordanpadams commented 1 month ago

@gxtchen you cannot test this this with the public registry until the multi-tenancy migration has completed: https://github.com/NASA-PDS/registry/issues/185

jordanpadams commented 1 month ago

you can try downloading and loading that data into a local registry and test with that

tloubrieu-jpl commented 1 month ago

@gxtchen can wait to test that until the API is up again.