As a data custodian, I want to load URLs / file paths without unnecessary / additional slashes

NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).

https://nasa-pds.github.io/registry

Other

4 stars 3 forks source link

As a data custodian, I want to load URLs / file paths without unnecessary / additional slashes #158

Closed nutjob4life closed 3 months ago

nutjob4life commented 6 months ago

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

Data Engineer

💪 Motivation

See https://github.com/NASA-PDS/operations/issues/476 for context; the issue is that somehow some file paths with double-slashes in them got into the Registry. For example, see

curl --silent 'https://pds.nasa.gov/api/search/1.0//products/urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.0/members/latest' \
    | json_pp | egrep '//data'

Those double-slashes cause the Deep Archive to also output double-slashes, which later fail validation.

These should not go into the Registry in the first place.

📖 Additional Details

No response

Acceptance Criteria

Given When I perform Then I expect

⚙️ Engineering Details

No response

al-niessner commented 5 months ago

@nutjob4life

The implication of this ticket would be that harvest must run validate on all products prior to ingestion. Given the hours that validate takes for some bundles you may be some push back.

The other implication would be that harvest start to implement a subset of validate. If they disagree, then which is right. The classical a person with one clock knows what time it is while a person with two clocks is never sure. Also, how much does it implement until it is the new validate.

nutjob4life commented 5 months ago

@jordanpadams consider @al-niessner's comment above ↑

"A good sailor always travels with one clock or three—never two." —A good sailor, possibly

jordanpadams commented 5 months ago

@nutjob4life @al-niessner update the story title to be more specific to this use case. we will not be running validate, but we want to load "cleaner" file paths / URLs in the future to avoid potential processing issues downstream, similar to what occurred with Deep Archive

nutjob4life commented 5 months ago

Thanks @jordanpadams! 🙏

al-niessner commented 5 months ago

@nutjob4life @jordanpadams

Do we scan the entire document or just the paths we butcher? harvest tries to convert, if told too, paths from local file to http server locations. Simple to correct butchering rather than document but could still linux valid and schema valid paths with multiple slashes (if allowed by schema). The ones returned by curl look like the butchered variety and is done using String.* so nobody is checking along the way and could be schema invalid.

jordanpadams commented 5 months ago

@al-niessner Sorry for the lack of clarity here. @nutjob4life is referencing data in the Registry now, but the point of this enhancement is to prevent it from happening in the future upon ingest through Harvest.

Between this part of the config and this part of the config a double-slash is being injected in here when we are forming the URL. We just want to make sure that doesn't happen.

al-niessner commented 5 months ago

@al-niessner Sorry for the lack of clarity here. @nutjob4life is referencing data in the Registry now, but the point of this enhancement is to prevent it from happening in the future upon ingest through Harvest.

Between this part of the config and this part of the config a double-slash is being injected in here when we are forming the URL. We just want to make sure that doesn't happen.

Yes, merging those two parts of the config file is the butchering I was referring to.