New records harvested in the registry don't have the expected Node value

tloubrieu-jpl commented 2 days ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

In the registry, we can find documents in the *-registry indexes with property ops:Harvest_Info/ops:node_name = geo or en

🕵️ Expected behavior

I expected to have values PDS_GEO, or PDS_ENG.

📜 To Reproduce

Harvest data as described in the manual https://docs.google.com/document/d/12W1DyaRYh4yYnw4p_-qFLQ0kmn_w14Sr/edit

🖥 Environment Info

No response

📚 Version of Software Used

Using harvest v4.0.1

🩺 Test Data / Additional context

The node value expected are:

PDS_ATM - Planetary Data System: Atmospheres Node
- PDS_ENG - Planetary Data System: Engineering Node
- PDS_GEO - Planetary Data System: Geosciences Node
- PDS_IMG - Planetary Data System: Imaging Node
- PDS_NAIF - Planetary Data System: NAIF Node
- PDS_PPI - Planetary Data System: Planetary Plasma Interactions Node
- PDS_RMS - Planetary Data System: Rings Node
- PDS_SBN - Planetary Data System: Small Bodies Node
- PSA - Planetary Science Archive
- JAXA - Japan Aerospace Exploration Agency

🦄 Related requirements

🦄 #187

⚙️ Engineering Details

It sounds related to the change in the code which made the node name in the harvest configuration obsolete.

🎉 Integration & Test

No response

jordanpadams commented 10 hours ago

Created a clear requirement to tie to this as well:

https://github.com/NASA-PDS/harvest/issues/187

al-niessner commented 7 hours ago

@jordanpadams @tloubrieu-jpl

Yes, this is to be expected. As discussed previously, moved the node name to be identified through the registry name. This is done here: https://github.com/NASA-PDS/harvest/blob/ae58da7bc6af0dad6c3ed9d4551e3ec2387711ab/src/main/java/gov/nasa/pds/harvest/cfg/ConfigManager.java#L60-L64

It always returns a name. Do you prefer to throw an error when not found in this table? https://github.com/NASA-PDS/harvest/blob/ae58da7bc6af0dad6c3ed9d4551e3ec2387711ab/src/main/java/gov/nasa/pds/harvest/cfg/ConfigManager.java#L21-L34

As you can see from the function, it is mangling your registry name to get the node name since you are not following the convention you gave me. That convention was used to create the connections here: https://github.com/NASA-PDS/registry-common/tree/main/src/main/resources/connections/cognito

We can do a couple of thing if this is a set of integration test nodes (maybe for some users initial review) but not full deployment nodes (ones that follow the original naming rules):

throw error if not recognized rather than make best guess
add the integration nodes to table above
change the naming rules and update all of the code
use the app://connections/cognito/*.xml instead of your own so that naming conventions are consistent

I would suggest 2 and 4. We can and should add more connections, both cognito and direct, to support more registries. We can then add them to the table above as needed with whatever name is desired.

nutjob4life commented 1 hour ago

Need answering for outstanding questions

tloubrieu-jpl commented 57 minutes ago

Hi @al-niessner ,

Sorry, after taking a bit more time to think of it, I believe you are right, that does not make sense to keep a hardcoded mapping in the code, because we would need to update the harvest code if we add a new discipline node. Although that might never happen that is an unnecessary hassle.

Besides, I am realizing we don't want to redundantly manage that harvesintg node in harvest since we know who the document has been harvested by from the index in which the document is. In the documents it shows as _index = geo-registry (although that might not be searchable)

Unless we have cases where someone would like to ingest into an index, e.g. geo-registry but be authored as a different node, for example PDS_SBN or a more specific entity, for example PDS_GEO_MARS ... I don't believe we need that, but I am leaving that decision to @jordanpadams. If we actually need that, we would go for option 4.

If we confirm we don't need that flexibility of having different harvesting nodes in the same index, I would completely remove this updates from the client side, in harvest, and move it servers ide, in sweeper or by an opensearch configuration (I am not finding which though).

@al-niessner , @alexdunnjpl , @jordanpadams , let me know what are your thoughts.

Thanks,

"

NASA-PDS / harvest