Closed nutjob4life closed 3 months ago
@nutjob4life
The implication of this ticket would be that harvest must run validate on all products prior to ingestion. Given the hours that validate takes for some bundles you may be some push back.
The other implication would be that harvest start to implement a subset of validate. If they disagree, then which is right. The classical a person with one clock knows what time it is while a person with two clocks is never sure. Also, how much does it implement until it is the new validate.
@jordanpadams consider @al-niessner's comment above β
"A good sailor always travels with one clock or threeβnever two." βA good sailor, possibly
@nutjob4life @al-niessner update the story title to be more specific to this use case. we will not be running validate, but we want to load "cleaner" file paths / URLs in the future to avoid potential processing issues downstream, similar to what occurred with Deep Archive
Thanks @jordanpadams! π
@nutjob4life @jordanpadams
Do we scan the entire document or just the paths we butcher? harvest tries to convert, if told too, paths from local file to http server locations. Simple to correct butchering rather than document but could still linux valid and schema valid paths with multiple slashes (if allowed by schema). The ones returned by curl look like the butchered variety and is done using String.* so nobody is checking along the way and could be schema invalid.
@al-niessner Sorry for the lack of clarity here. @nutjob4life is referencing data in the Registry now, but the point of this enhancement is to prevent it from happening in the future upon ingest through Harvest.
Between this part of the config and this part of the config a double-slash is being injected in here when we are forming the URL. We just want to make sure that doesn't happen.
@al-niessner Sorry for the lack of clarity here. @nutjob4life is referencing data in the Registry now, but the point of this enhancement is to prevent it from happening in the future upon ingest through Harvest.
Between this part of the config and this part of the config a double-slash is being injected in here when we are forming the URL. We just want to make sure that doesn't happen.
Yes, merging those two parts of the config file is the butchering I was referring to.
Checked for duplicates
No - I haven't checked
π§βπ¬ User Persona(s)
Data Engineer
πͺ Motivation
See https://github.com/NASA-PDS/operations/issues/476 for context; the issue is that somehow some file paths with double-slashes in them got into the Registry. For example, see
Those double-slashes cause the Deep Archive to also output double-slashes, which later fail validation.
These should not go into the Registry in the first place.
π Additional Details
No response
Acceptance Criteria
Given When I perform Then I expect
βοΈ Engineering Details
No response