NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

Update harvest to support `ByteOrderMark` (BOM) as its first bytes #142

Closed al-niessner closed 10 months ago

al-niessner commented 10 months ago

🗒️ Summary

Some of the XML files come with BOMs. Upgraded the file reading to process BOMs.

⚙️ Test Data and/or Report

Automated unit tests below should pass

With changes, processing the bundle after multiple harvests gives (obviously they were ingested the first time or they would not be skipped this time):

[SUMMARY] Reading configuration from /home/niessner/Projects/PDS/harvest/src/test/resources/github141.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://elasticsearch:9200, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Processing bundle directory /home/niessner/Projects/PDS/harvest/src/test/resources/github141
[INFO] Processing bundle /home/niessner/Projects/PDS/harvest/src/test/resources/github141/bundle_hausrath_m2020_pixl_naltsos.xml
[WARN] Bundle urn:nasa:pds:hausrath_m2020_pixl_naltsos::1.1 already registered. Skipping.
[INFO] Processing collection /home/niessner/Projects/PDS/harvest/src/test/resources/github141/data/collection_data_inventory.xml
[WARN] Collection urn:nasa:pds:hausrath_m2020_pixl_naltsos:data::2.0 already registered. Skipping.
[INFO] Processing collection /home/niessner/Projects/PDS/harvest/src/test/resources/github141/document/collection_document_inventory.xml
[WARN] Collection urn:nasa:pds:hausrath_m2020_pixl_naltsos:document::2.0 already registered. Skipping.
[INFO] Processing products...
[INFO] Skipping product /home/niessner/Projects/PDS/harvest/src/test/resources/github141/data/oxides_bulksum.xml (LIDVID/LID is not in collection inventory or already exists in registry database)
[INFO] Skipping product /home/niessner/Projects/PDS/harvest/src/test/resources/github141/data/oxides_pmc.xml (LIDVID/LID is not in collection inventory or already exists in registry database)
[INFO] Skipping product /home/niessner/Projects/PDS/harvest/src/test/resources/github141/document/supporting_information.xml (LIDVID/LID is not in collection inventory or already exists in registry database)
[SUMMARY] Summary:
[SUMMARY] Skipped files: 6
[SUMMARY] Loaded files: 0
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: f6023f49-bba5-456f-b121-e132b2d8ceae

♻️ Related Issues

Closes #141

jordanpadams commented 10 months ago

@al-niessner per the Test Automation process here, can we:

al-niessner commented 10 months ago

@al-niessner per the Test Automation process here, can we:

@jordanpadams

It may take some time (not on JPLNet and cannot see all the links). The problem is in the one file and contained in the Java unit tests. Are they not part of the automated testing already? Would it be better to add the Java unit tests than this isolated test? The problem is not shoving data into opensearch (postman is good for that) but just that a file is readable and not necessarily worthy of processing (unit test over full processing).

jordanpadams commented 10 months ago

@al-niessner bah. nevermind. you are right. the unit test is a better place for this.

jordanpadams commented 10 months ago

@alexdunnjpl can you give a quick once-over here when you have a chance?

alexdunnjpl commented 10 months ago

@jordanpadams you mean code review, or something else?

jordanpadams commented 10 months ago

@alexdunnjpl yes