Add secondary references to metadata

tdddblog commented 3 years ago

Added secondary reference fields in 'registry-refs' index.
Automatically assign data type 'keyword' to reference fields 'ref_lid_xxx' and 'ref_lidvid_xxx' if field definition is missing.
Added pds:Observation_Area/pds:Discipline_Area definition
Better error message for 'Could not find datatype...' error

this PR relates directly with https://github.com/NASA-PDS/harvest/pull/47

resolves https://github.com/NASA-PDS/registry/issues/108 resolves https://github.com/NASA-PDS/registry/issues/109

al-niessner commented 3 years ago

@tloubrieu-jpl @jordanpadams

Jumping into the middle means I have a lot to learn. Why add secondary_product_lid? How do these differ (in terms of search) from product_lid that already exist?

These complicate future searches in the sense that if I know I want product lidvid now I have to check two lists rather than one. Are we then going to add tertiary_product_lid* in the future?

I would think a more robust to future redirection is: "products":[{"ref":lid or lidvid, "relationship":primary or secondary or tertiary or negative or inverted or plasterboard}] and none of the code that does a search based on ref needs to change. Even better would be to have "ancestors":[{"ref":lid or lidvid, "relationship":primary/secondary, "type":bundle/collection/product}] then do same for "offspring":[]. You will want a tool that you can then traverse the entire database and make sure ancestors <-> offspring agree. In other words, do that age old tiresome doubling linked list. When searching these, I can say relationship=secondary and type=bundle rather than having to know what list to search (product_secondary, product, bundle, or bundle_secondary). When things change in the future, still need a reference, relationship, and type even though it may be a new any one of those.

That brings me back to lid and lidvid. Both of those are problematic for the same reason. I just got done doing the code changes for latest version if the vid is missing from the lidvid. Could have made elastic search do more of the work if lidvid was not all one name. Could then ask ES to return the sorted greatest or least value rather than doing Java code.

Again, I do not know all of the constraints or pressures are and maybe my suggestions are not feasible but it looks like brittleness and fragility are being built into the system. Maybe the breakout on Tuesday?

jordanpadams commented 3 years ago

@al-niessner great questions. see below.

@tloubrieu-jpl @jordanpadams

Jumping into the middle means I have a lot to learn. Why add secondary_product_lid? How do these differ (in terms of search) from product_lid that already exist?

Per the PDS4 Standards Reference (SR) section 2A.4:

2A.4 Primary and Secondary Members Basic products may be either primary or secondary members of their respective collections. A primary member is one that is being registered with PDS for the first time. A secondary member is one which is already registered with PDS, but which is now associated with an additional collection. A product’s member status (primary or secondary) is based on its first association with a collection. Although the product may be omitted from a later version of the collection, it retains its primary or secondary member status through all subsequent versions of the collection based on its initial association. In a similar way, collections are categorized as having either primary or secondary ‘member status’ in their bundles.

One way to think of this is a primary product exists in its "primary" location as a child of a collection, but other collections can "symlink" to its primary location. Before this update, we only captured the primary members of collections/bundles, and not secondary. From a system perspective and building services on top of the registry, we want to know those secondary members as well.

These complicate future searches in the sense that if I know I want product lidvid now I have to check two lists rather than one. Are we then going to add tertiary_product_lid* in the future?

No such thing as tertiary, but i get your point. do we think it makes more sense to just have all of them as product_lid, but then capture the primary vs secondary information in some other fashion?

I would think a more robust to future redirection is: "products":[{"ref":lid or lidvid, "relationship":primary or secondary or tertiary or negative or inverted or plasterboard}] and none of the code that does a search based on ref needs to change. Even better would be to have "ancestors":[{"ref":lid or lidvid, "relationship":primary/secondary, "type":bundle/collection/product}] then do same for "offspring":[]. You will want a tool that you can then traverse the entire database and make sure ancestors <-> offspring agree. In other words, do that age old tiresome doubling linked list. When searching these, I can say relationship=secondary and type=bundle rather than having to know what list to search (product_secondary, product, bundle, or bundle_secondary). When things change in the future, still need a reference, relationship, and type even though it may be a new any one of those.

@al-niessner interesting point. @tdddblog thoughts on this approach?

That brings me back to lid and lidvid. Both of those are problematic for the same reason. I just got done doing the code changes for latest version if the vid is missing from the lidvid. Could have made elastic search do more of the work if lidvid was not all one name. Could then ask ES to return the sorted greatest or least value rather than doing Java code.

I imagine this is because we are just ingesting whatever is in the collection inventory files (2 column tables that contain primary/secondary designation, LID/LIDVID. but I imagine we could split this out at ingest time.

Again, I do not know all of the constraints or pressures are and maybe my suggestions are not feasible but it looks like brittleness and fragility are being built into the system. Maybe the breakout on Tuesday?

copy. we should talk about this some more with @tdddblog and consider some of these other options.

tdddblog commented 3 years ago

I would think a more robust to future redirection is: "products":[{"ref":lid or lidvid, "relationship":primary or secondary or tertiary or negative or inverted or plasterboard}] and none of the code that does a search based on ref needs to change. Even better would be to have "ancestors":[{"ref":lid or lidvid, "relationship":primary/secondary, "type":bundle/collection/product}] then do same for "offspring":[]. You will want a tool that you can then traverse the entire database and make sure ancestors <-> offspring agree. In other words, do that age old tiresome doubling linked list. When searching these, I can say relationship=secondary and type=bundle rather than having to know what list to search (product_secondary, product, bundle, or bundle_secondary). When things change in the future, still need a reference, relationship, and type even though it may be a new any one of those.

Nobody traverses entire databases these days, especially distributed databases.
Lists of references in Elasticsearch are used to optimize write and search performance. You don't want to create new ES document for each reference.
API should be able to translate "relationship=secondary and type=bundle" query into Elasticsearch query.

That brings me back to lid and lidvid. Both of those are problematic for the same reason. I just got done doing the code changes for latest version if the vid is missing from the lidvid. Could have made elastic search do more of the work if lidvid was not all one name. Could then ask ES to return the sorted greatest or least value rather than doing Java code.

Registry index has lidvid, lid and vid fields for all products. @al-niessner you can query ES to get the latest version of a product.

tdddblog commented 3 years ago

I did some refactoring of registry-refs ES index:

I added "reference_type" field which could take "P" or "S" values.
Primary and secondary references are stored in different ES documents.
ES doc ids have the following naming convention <collection_lidvid>::<S|P><batch_id>, for example, urn:nasa:pds:orex.spice:spice_kernels::8.0::P1, urn:nasa:pds:orex.spice:spice_kernels::8.0::S1, urn:nasa:pds:orex.spice:spice_kernels::8.0::S2, etc.
Primary and secondary references use the same field names, e.g., product_lidvid

al-niessner commented 3 years ago

@tdddblog Sorry, but I am a bit lost again(still?).

Why the indexing (batch id) one the P and S? Are they just the index in the list if start P at 0 instead of 1? Are the S's just one ups as in S1, S2, S3, ... or can they be S1, S2, S2, S2, S3...?

Independent of the actual numbering details, my real question/point is why? Do we really think that @tloubrieu-jpl is going to make a secondary link today so that it gets assigned S2 then 5 years from now @jordanpadams is going to as me to find the link added back at the end of March 2021? First I would have to find out that @tloubrieu-jpl added it then ask what batch number it was if he remembers. Now I made it 5 years to make a point but realistically is the memory of the batch number 1 minute, 1 hour, 1 day, 1 week, 1 month, ...

al-niessner commented 3 years ago

I also still do not understand why we care if it is P or S at all.

Lets start with fictional collections snafu and fu. Now, lets say @jordanpadams relays to @tloubrieu-jpl that the fictional product bar needs to be in both collections snafu and fu. They then assign the ticket to me. I also have to add 10 other products to snafu so I just put bar there as well and link it to fu.

Are we really saying that we would have to change the primary to fu because it being in snafu somehow makes it wrong -- beyond the arbitrary or aesthetic? When the user searches for product bar and says, ah there it is under fu, they are going to then reject it because it is not the primary (P)? Can the end user even tell? Should the end user be able to tell?

jordanpadams commented 3 years ago

@al-niessner let's chat about this some more at the breakout today. I understand where you're coming from here, but knowing whether or not a product is primary vs secondary may come into play at some point in the services wrapping the registry. A more comprehensive schema for tracking this info may beneficial down the road, but there may also be performance implications if we chance the schema that extensively. We can also just abstract this away through the API for the time being. And come back to it down the road if we determine we really need to track this info differently for some use case Y.

NASA-PDS / registry-mgr

Add secondary references to metadata #21