NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

harvest.log summary does not agree with OpenSearch counts #147

Open plawton-umd opened 9 months ago

plawton-umd commented 9 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I did compared information from the harvest.log to the OpenSearch (OS) query results, I noticed differences.

🕵️ Expected behavior

I expected the "count" after the load to equal the "count" before the load plus the harvest.log's number of "Loaded Files".
The harvest.log summary says 150 fewer files were loaded than the OS "count" ( curl -u $REGUSER $OPENSEARCH_URL'/registry/_count?pretty=true' ) says.

📜 To Reproduce

  1. Maybe have a harvest run experience skips?

🖥 Environment Info

🩺 Test Data / Additional context

See above

🦄 Related requirements

Tightly coupled with

⚙️ Engineering Details

N/A

alexdunnjpl commented 4 months ago

This is a shot in the dark, but I don't want to overlook the potential of it being relevant - if the harvest is experiencing any errors due to timeouts, it's possible for them to be listed as failures (because the client never received confirmation that the insertions succeeded) but for them to be ingested nonetheless (because the server did get those insertions and processed them, but was overloaded at the time and took too long to handle them).

@plawton-umd if you have any firm sense of whether this is plausible, let me know

plawton-umd commented 4 months ago

@alexdunnjpl No idea. Sometimes in the logs it looks like it