Refactor registry-refs index with generic references field

tloubrieu-jpl commented 2 years ago

💡 Description

Add a references field to the registry index, the references field will contain all the lid or lidvid references from a product to other products so that the 'parents' of a product can be easily searched for. The only relationship which will not be stored is the links from collections to products since that would make the documents too big.

This aims at solving ticket https://github.com/nasa-pds/registry-api/issues/150

This needs to work on both versions of harvest: https://github.com/NASA-PDS/harvest/ and https://github.com/NASA-PDS/registry-harvest-service

tloubrieu-jpl commented 2 years ago

@al-niessner the repositories are listed in the description of this ticket.

There are 2 implementations of harvest, then 2 repos. They are sharing a common library in repository https://github.com/NASA-PDS/registry-common/

al-niessner commented 2 years ago

@tloubrieu-jpl @jordanpadams

How much do we know about the design of harvest? After spending two days chewing through code and trying to figure out how to add references, still a bit lost. Is there a document that describes the processing flow? It seems to read a bunch XML files and a mix of filenames and root tag are used to process them as either bundle, collection, or product. It is there, harvest.crawler, where the ref_lid_collection and ref_lidvid_collection are filled, but they are not directly called that yet so some uncertainty remains. When all the data is loaded it written back out to JSON files via registry-common.

What happens to the JSON files? What hunk of code actually pushes it into opensearch? There is only one PUT in all the code and it is for _mapping and there are no PUSHes. It seems like registry-common sets up the mappings then something else pushes the documents, but what?

Searched for ref_lid_collection and ref_lidvid_collection and it appears in registry-common but being read from JSON file not written and not being placed in a mapping.

So now I am looking for a more conceptual description of what is desired and what is implemented. Why the go from XML to JSON rather than directly to opensearch? Is it to support the clusters? Does the JSON have a predefined layout or is it all adhoc and everything is built around it? Then there are all of the reflid items too, like ref_lid_target and ref_lid_document, are processed the registry-common extractors.

Rather than filling up the documents with references, maybe it would be better to update registry-refs instead with something like owner, product_class, and references? Maybe even yet another index?

tloubrieu-jpl commented 2 years ago

Hi @al-niessner ,

I don't think we have a conceptual design of harvest.

We are going from XML to JSON, because in the past, harvest was generating a JSON file which was later loaded into opensearch/elasticsearch by a different tool called registry-manager. I don't remember why but that should be the reason for the intermediate JSON. We don't do that anymore but Eugene did not rewrite the code following that change.

I can not answer your other questions, but to start with, since we don't understand the full architecture, I would go for minimalistic changes to stay on the safe side until we understand better.

Thanks

al-niessner commented 2 years ago

@tloubrieu-jpl

I am going with no changes for now... Funny enough, I think updating registry-ref would smaller than updating registry. Still trying to work it all out. My next approach is to run harvest while in eclipse so that I can watch harvest at work. Any advice/suggestions?

al-niessner commented 2 years ago

@tloubrieu-jpl @jordanpadams

Running harvest in eclipse was the winner. We can fix NASA-PDS/registry-api#150 with a two line change to registry-mgr:src/main/resources/elastic/registry.json -- well, it fixes the testing system but not the pds gamma unless that file is used with pds gamma. Probably would mean a full reload of pds gamma too.

So, the question is, do we fix that one file and call it good until the next crisis which we are setting up with this fix or do we keep going with a much longer term fix?

jordanpadams commented 2 years ago

@al-niessner adding @jimmie and @ramesh-maddegoda to this, since @tloubrieu-jpl is heading out on leave at the end of the week.

it sounds like if we need to reload the data, we should do so in place by creating another index and moving the data over. in operations, there is no way we can reload the actual data, so we might as well get used to reloading using other methodologies.

it looks like we have a start for that here: https://linuxhint.com/change-field-type-elasticsearch/

are you using gamma for your testing? also, do we have another registry/API deployed through the continuous deployment @nutjob4life has setup with Jenkins?

tloubrieu-jpl commented 2 years ago

@al-niessner sorry for this late notice but indeed @ramesh-maddegoda fixed a bug lately on harvest, so you will be able to help each other on this tool. Thanks.

al-niessner commented 2 years ago

@jordanpadams @tloubrieu-jpl @jimmie

While NASA-PDS/registry-mgr#50 addresses the immediate needs of NASA-PDS/registry-api#150, it is a short term fix that sets up for a bigger problem later but gives time to fix either registry or registry-refs for long term.

The initial idea for a long term fix was to add references to the main document. After reviewing harvest and other tools, it seems that adding references to the main document is not the best idea. It seems better to add values to the registry-ref index instead.

The current registry-refs looks like:

"properties" :
{
  "_package_id" : {"type" : "keyword"},
  "batch_id" : {"type" : "integer"},
  "batch_size" : {"type" : "integer"},
  "collection_lid" : {"type" : "keyword"},
  "collection_lidvid" : {"type" : "keyword"},
  "collection_vid" : {"type" : "float"},
  "product_lid" : {"type" : "keyword"},
  "product_lidvid" : {"type" : "keyword"},
  "reference_type" : {"type" : "keyword"}
}

From the perspective of the NASA-PDS/registry-api the most usable and desirable document would be:

"properties" :
{
  "owner_lid" : {"type" : "keyword"},
  "owner_lidvid" : {"type" : "keyword"},
  "owner_type" : {"type" : "keyword"},
  "references" : {"type" : "keyword"}
}

where owner_type is the product class and references is all lid and lidvids directly referenced by the owner. These can be added to registry-refs, which seems most sensible, or its own index. If the index where referencing or link then change owner to source and references to targets to ape unix link command.

NASA-PDS / registry

Refactor registry-refs index with generic references field #81

💡 Description