NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

Refactor harvest to operate with new multi-tenant, serverless OpenSearch architecture #146

Closed al-niessner closed 8 months ago

al-niessner commented 8 months ago

🗒️ Summary

Brief summary of changes if not sufficiently described by commit messages.

⚙️ Test Data and/or Report

One of the following should be included here:

♻️ Related Issues

Closes #118

al-niessner commented 8 months ago

@alexdunnjpl @jordanpadams @nutjob4life @tloubrieu-jpl

Schema has been implemented and examples updated. Would like to review because older "legacy" harvest config files cannot be used without minor changes. Should not be a problem because URL is not necessary for non-testing harvest configs. Can review be done during today's breakout? It would be good to keep going with this but having to change tag names or something after this will become harder and harder.

al-niessner commented 8 months ago

Current state of schema and tests are:

$ xmllint --noout --schema configuration.xsd examples/bundles.xml examples/directories.xml examples/files.xml examples/xpaths.xml 
examples/bundles.xml:20: element harvest: Schemas validity error : Element 'harvest', attribute 'nodeName': [facet 'enumeration'] The value 'CHANGE_ME' is not an element of the set {'PDS_ATM', 'PDS_ENG', 'PDS_GEO', 'PDS_IMG', 'PDS_NAIF', 'PDS_PPI', 'PDS_RMS', 'PDS_SBN', 'PSA', 'JAXA', 'ROSCOSMOS'}.
examples/bundles.xml fails to validate
examples/directories.xml:17: element harvest: Schemas validity error : Element 'harvest', attribute 'nodeName': [facet 'enumeration'] The value 'CHANGE_ME' is not an element of the set {'PDS_ATM', 'PDS_ENG', 'PDS_GEO', 'PDS_IMG', 'PDS_NAIF', 'PDS_PPI', 'PDS_RMS', 'PDS_SBN', 'PSA', 'JAXA', 'ROSCOSMOS'}.
examples/directories.xml fails to validate
examples/files.xml:17: element harvest: Schemas validity error : Element 'harvest', attribute 'nodeName': [facet 'enumeration'] The value 'CHANGE_ME' is not an element of the set {'PDS_ATM', 'PDS_ENG', 'PDS_GEO', 'PDS_IMG', 'PDS_NAIF', 'PDS_PPI', 'PDS_RMS', 'PDS_SBN', 'PSA', 'JAXA', 'ROSCOSMOS'}.
examples/files.xml fails to validate
examples/xpaths.xml:3: element harvest: Schemas validity error : Element 'harvest', attribute 'nodeName': [facet 'enumeration'] The value 'CHANGE_ME' is not an element of the set {'PDS_ATM', 'PDS_ENG', 'PDS_GEO', 'PDS_IMG', 'PDS_NAIF', 'PDS_PPI', 'PDS_RMS', 'PDS_SBN', 'PSA', 'JAXA', 'ROSCOSMOS'}.
examples/xpaths.xml fails to validate

Left the name broken in each just to show that xmllint is working as expected.

jordanpadams commented 8 months ago

@al-niessner @tloubrieu-jpl was this review completed at the breakout yesterday?

al-niessner commented 8 months ago

@al-niessner @tloubrieu-jpl was this review completed at the breakout yesterday?

@jordanpadams No

al-niessner commented 8 months ago

@jordanpadams

Changed some the items and suggest do -> load or maybe ingest. Changed direct_url to server_url since that is really what it is and you like self documenting names. Now cognito_client_id for same reason. Take a gander again and I will update code again. I need to run the local test to make sure I have not broken anything but I do not think so.

al-niessner commented 8 months ago

At this point (52aec27) can use harvest to load bundle from #143 with what seem to be the same results:

[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 11139
[SUMMARY]   Product_Bundle: 1
[SUMMARY]   Product_Collection: 6
[SUMMARY]   Product_Document: 5
[SUMMARY]   Product_Observational: 11127
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: 45cd1442-256d-4c80-9dcf-c54f2191f749

Next is to start looking at modifying registry-common to have better registry connection interface than a public bean with no setter/getter that will allow for polymorphism to help when there are different ways to contact different services like coginito or direct - the cause of the original issue.

jordanpadams commented 8 months ago

@al-niessner note we needed to update the .secrets.baseline for the secrets detection to run successfully. You may want to add the pre-commit to your local git config so this can be caught and updated locally in the future.

The pre-commit and info about Detect Secrets are being added to Harvest README here: https://github.com/NASA-PDS/harvest/pull/150

But that pretty much links here: https://github.com/NASA-PDS/nasa-pds.github.io/wiki/Git-and-Github-Guide#detect-secrets