jordanpadams commented 1 year ago

💪 Motivation

...so that I can only get back the products of most relevance. We do not want to return superseded products unless they are explicitly requested.

📖 Additional Details

Related requirement:

Motivation

...so that I can get the latest version of a product only, ignoring the superseded versions of the product. We do not want to return superseded data unless specifically requested, e.g.:

/products/{lidvid}           # request for a specific version of a product
/products/{identifier}/all   # request for all versions of a product

Additional Details

Acceptance Criteria

Given a registry populated with multiple versions of a product When I perform a query to the products/q=something Then I expect to only have the latest version of the matching products returned

Given a registry populated with multiple versions of a collection When I perform any query of classes/collection endpoint Then I expect to only have the latest version of the expected matching collections

Given a registry populated with multiple versions of a bundle When I perform any query of classes/bundle endpoint Then I expect to only have the latest version of the expected matching bundles returned

⚙️ Engineering Details

al-niessner commented 1 year ago

@jimmie @jordanpadams @tloubrieu-jpl

As per breakout, been reviewing the idea of a "latest" registry/index to make this request possible.

develop a script to use opensearch commands to generate a subset index based on the whole dataset
modify NASA-PDS/registry-api to use latest index for query unless the endpoint has /all in it
the script will have to be run after any data insertions - use aliases too so that reindex can happen while current index can be used then switch over when done for shortest period of "down time"

Potential snag in all this that we cannot convince opensearch to find unique lid with largest vid. Can easily do matches and return all matches not sets of unique values. The vid is a quasi float and saved as string because 1.2.3 is not really a float. Finding in string world 10.0 is less than 7.0. Annoying but strings are not numbers. Developing a way to determine the greatest vid may be difficult.

If cannot get opensearch to do these things - it is very likely that opensearch cannot be altered to that extent - then will need a python or something script to do the work instead. Likely this will be slower and require a good bit of resources, but it is straight forward.

jordanpadams commented 1 year ago

@al-niessner this plan sounds great!

the script will have to be run after any data insertions - use aliases too so that reindex can happen while current index can be used then switch over when done for shortest period of "down time"

doesn't the managed OpenSearch/Elastic handle this under the hood? we don't do any aliasing now, but when we load data it reindexes, and there is no known downtime (maybe we are not watching that closely). per the executing the script every time we perform data insertions, I wonder if this would be very costly? we may want to pick a nominal cadence to run this at.

Potential snag in all this that we cannot convince opensearch to find unique lid with largest vid. Can easily do matches and return all matches not sets of unique values. The vid is a quasi float and saved as string because 1.2.3 is not really a float.

so we can actually take advantage of what the standard documentation requires for VID, regardless of the type defined:

Bundle, collection, and product version identifiers are issued sequentially, although not all version numbers may appear in an archive. M denotes a major version and n denotes a minor version. M is initialized to “1” for archives, but “0” may be used for samples and tests; n is initialized to “0”.

so we can create a new custom field in the registry specifically for this purpose (e.g. _version_number) and use that to key the subset query.

thoughts?

al-niessner commented 1 year ago

@jimmie @jordanpadams @tloubrieu-jpl

the script will have to be run after any data insertions - use aliases too so that reindex can happen while current index can be used then switch over when done for shortest period of "down time"

doesn't the managed OpenSearch/Elastic handle this under the hood? we don't do any aliasing now, but when we load data it reindexes, and there is no known downtime (maybe we are not watching that closely). per the executing the script every time we perform data insertions, I wonder if this would be very costly? we may want to pick a nominal cadence to run this at.

You can add documents (records in old db language) without a hiccup using open/elastic search. You cannot reindex in-place without disruption. Elastic/open search add the alias feature to minimize this disruption and, probably, for other good reasons I have not discovered yet. It allows to have r1 but call it registry where registry is the alias. You could then add items to the index with same documents and have that reindex become r2. When done, change registry to r2 and you have instantaneous change. That is way different than adding a document which adds new indexing but does not reindex all of the documents. In the document insertion case, they can probably push the work under the hood but in the case of reindexing they require manual intervention to hid the overhead.

Potential snag in all this that we cannot convince opensearch to find unique lid with largest vid. Can easily do matches and return all matches not sets of unique values. The vid is a quasi float and saved as string because 1.2.3 is not really a float.

so we can actually take advantage of what the standard documentation requires for VID, regardless of the type defined:

Of course we can take advantage of the VID. Outside of opensearch we can split the parts and sort them independently as integers from left to right. Inside opensearch is a bit more complicated and restricted. The question is not if we can take advantage but how and most effective.

Bundle, collection, and product version identifiers are issued sequentially, although not all version numbers may appear in an archive. M denotes a major version and n denotes a minor version. M is initialized to “1” for archives, but “0” may be used for samples and tests; n is initialized to “0”.

so we can create a new custom field in the registry specifically for this purpose (e.g. _version_number) and use that to key the subset query.

thoughts?

If we want opensearch to do all the work, then we will have to add a custom field that is an array of integers and sort them left to right. For instance, 7.1 and 7.10 are the same float to a computer so needs to be [7,1] and [7,10] then sort them accordingly.

I am not sure how much faster it would be to have opensearch do the work than to have a script do it. The advantage of the script is that we can do what we have to at that moment and not fuss with making opensearch do something it is not designed to do directly. Once the script is working, it would be more direct to decide if opensearch can do it directly. It would also be testable via search vs script.

jordanpadams commented 1 year ago

In the document insertion case, they can probably push the work under the hood but in the case of reindexing they require manual intervention to hid the overhead.

well let's test it out and see what happens. having to swap between OpenSearch instances seems awful and very painful and something I feel like the AWS-managed OpenSearch is supposed to handle.

If we want opensearch to do all the work, then we will have to add a custom field that is an array of integers and sort them left to right. For instance, 7.1 and 7.10 are the same float to a computer so needs to be [7,1] and [7,10] then sort them accordingly.

that is where I am thinking we should try first. we could also do 2 separate values (_version_major and _version_minor).

I am not sure how much faster it would be to have OpenSearch do the work than to have a script do it.

Yea not sure either, but just seems unlikely for a Python script to be more performant than an inherent OpenSearch capability, but I may be wrong. My biggest concern is when we get to the 5 million document mark, then what does this look like. And where will this python script run? Could we do this in Lambda instead?

I am thinking using OpenSearch out of the box seems like the lowest hanging fruit, and the easiest implementation. If it becomes too slow, we develop a better option.

Thoughts?

al-niessner commented 1 year ago

@jimmie @jordanpadams @tloubrieu-jpl

In the document insertion case, they can probably push the work under the hood but in the case of reindexing they require manual intervention to hid the overhead.

well let's test it out and see what happens. having to swap between OpenSearch instances seems awful and very painful and something I feel like the AWS-managed OpenSearch is supposed to handle.

It does not require opensearch instances. To reindex, you just have yet another index (any opensearch instance allows unlimited randomly named indices) and the alias allows a quick switch between indexes when one has been reindexed (meaning new index fields).

If we want opensearch to do all the work, then we will have to add a custom field that is an array of integers and sort them left to right. For instance, 7.1 and 7.10 are the same float to a computer so needs to be [7,1] and [7,10] then sort them accordingly.

that is where I am thinking we should try first. we could also do 2 separate values (_version_major and _version_minor).

We can do this but it will require that the live system to be reindexed (registry) to include the _version_major and _version_minor so they can be searched. Cannot search on values not indexed. Without the help of aliasing registry, this is going to cost you a lot -- read all documents, break apart VID, add new fields, insert into new index. Hint, use the script to find the snags.

I am not sure how much faster it would be to have OpenSearch do the work than to have a script do it.

Yea not sure either, but just seems unlikely for a Python script to be more performant than an inherent OpenSearch capability, but I may be wrong. My biggest concern is when we get to the 5 million document mark, then what does this look like. And where will this python script run? Could we do this in Lambda instead?

I am thinking using OpenSearch out of the box seems like the lowest hanging fruit, and the easiest implementation. If it becomes too slow, we develop a better option.

Thoughts?

Doubt it can be a lambda because it will take too long to run. Until we understand full process, opensearch is not lowest fruit. Working with live system and being the most efficient, opensearch is likely the better choice. May have to use the opensearch painless script language which is more learning curve too.

Outside the test data set, will need to get opensearch to break apart VID using painless and reindex. Will have to update harvest to have _version_major and _version_minor for the future too. Keeping VID as is and handling in script (painless or python) during reindex may be simpler, really.

Lastly, python is simplest and surely, but not certainly, the slowest. When I say use python, I really mean make opensearch via https interface do the maximum work then use python to do what is not straight forward in opensearch on the first pass.

jordanpadams commented 1 year ago

@al-niessner status: working on developing test data and how to efficiently re-index

tloubrieu-jpl commented 1 year ago

@al-niessner created a new index but that does not scale on the prodution database.

tloubrieu-jpl commented 1 year ago

@al-niessner will propose a new design for handling the latest products

al-niessner commented 1 year ago

@jordanpadams @tloubrieu-jpl

Plan A

Idea

Use opensearch reindex functionality to build a latest index that would be generated after every harvest run. Keeps the harvest tool separate (do not fix it if it is not broken) and would be down outside the view of the users of PDS. It could be the most efficient because it would be done inside opensearch. Possible snags are that reindex scripting is not sufficient to get latest vid during reindex processing.

Why it failed

Simply put, the reindex task is not scaleable. Could use an open search script to do the reindex if it was simple, but having to collect similar lids and compare vids is not possible in the reindex script (cannot collect all vids for lid but have to operate only on the give document). When the attempt was made to do the aggregation for all lids on a partial operations system, quickly ran out of resources. The lack of resources comes in two parts: One, there are not a lot of vids for any given lid meaning need lots and lots of buckets (the aggregation container). Two, have to look at all lidvids every time which only takes more time as more data is added.

Plan B

Idea

Have a process triggered every time harvest ingests data. The process would have to know the current time when process started and the previous time it ran - could artificially broaden window to not have border problems. It could then look for all lidvids whose ops:Harvest_Info/ops:harvest_date_time is between and including the current time and previous time. It would then process this subset of lidvids and update the latest index where necessary.

Snags

If there is more than one harvest running, a method of synchronization would have to happen to have just one process updating latest index at a time to ensure atomic updates. A one-shot where harvest completion sets it high (always) and starting of latest process sets it low after current time is defined. A new latest process is started when the one-shot is high and the latest process is not running. This way, any number of harvest can set the one-shot and when the current latest process ends.

There will be a delay between updating registry and latest. Any solution where latest is updated after harvest, a lag will exist between registry index belief in largest vid and latest index belief in largest vid and is the processing time.

tloubrieu-jpl commented 1 year ago

We will run the script as a cronjob, every 45 minutes for example on each registry.

tloubrieu-jpl commented 1 year ago

This requirement is still open since the API needs to be updated to use the new index created by the attached script

al-niessner commented 1 year ago

@jordanpadams @tloubrieu-jpl

Update to Plan B:

Using the least memory required approach it takes nearly seconds per entry to process a lidvids individually - meaning check the latest index with a single lidvid to see if the lidvid from registry index is the latest. For speed improvements, may need to use more memory and process data in a batches. Change script to be heaviest memory then batch the update and see how fast it is. 4 million entries with each lidvid being 100 bytes is 4 GB of memory. Could be worse but modern computers are good with this kind of memory usage.

jordanpadams commented 1 year ago

@al-niessner sounds great. whatever you think is best let's roll with it. I agree memory should not be a concern as long as we document the system requirements so we can properly manage the necessary resources.

tloubrieu-jpl commented 1 year ago

@al-niessner did you consider the previous implementation that Eugene made with OpenSearch aggregations (see https://github.com/NASA-PDS/registry-api-service/blob/1258c795b24e3f09a826eb89bc3931bbd2441832/src/main/java/gov/nasa/pds/api/engineering/elasticsearch/business/LidVidUtils.java#L142) to handle the latest/all requests ?

If not, no worries, we can keep on going the route you proposed since it is well advanced. But the the future that would be interesting to be able to use these OpenSearch aggregations which are comparable to SQL views. I like that approach because that prevent us from having additional components to maintain for the synchronization of the additional index.

al-niessner commented 1 year ago

@al-niessner did you consider the previous implementation that Eugene made with OpenSearch aggregations (see https://github.com/NASA-PDS/registry-api-service/blob/1258c795b24e3f09a826eb89bc3931bbd2441832/src/main/java/gov/nasa/pds/api/engineering/elasticsearch/business/LidVidUtils.java#L142) to handle the latest/all requests ?

If not, no worries, we can keep on going the route you proposed since it is well advanced. But the the future that would be interesting to be able to use these OpenSearch aggregations which are comparable to SQL views. I like that approach because that prevent us from having additional components to maintain for the synchronization of the additional index.

@tloubrieu-jpl

I did try aggregations. As the notes above detail but, in short, it is called Plan A. Also note that VID as string does not sort correctly (1.10 comes before 1.2 in ascending string sorting) so it would not work in either case.

tloubrieu-jpl commented 1 year ago

Thanks @al-niessner , that made it clear, sorry I missed that before.

al-niessner commented 1 year ago

@jimmie @jordanpadams @tloubrieu-jpl

Another Plan B surprise. The scroll page is allowed be arbitrarily big according to the opensearch documentation. Here is the snag:

scroll page size between 1 and 10000, total number of items is 6089757 scroll page size between 10001 and 100000 (gave up at that point), total number of items is 2171732

Apparently not so arbitrary and not sure what happens that total number of items found changes without errors. Being able to go bigger makes things run faster so might be good work this through.

On the good side of the fence, all the processing comfortably fits on my laptop with 32 GB of memory.

tloubrieu-jpl commented 1 year ago

@al-niessner that is weird, @jimmie do you know which number is closer to the reality ? @al-niessner which server are you using for these tests ?

al-niessner commented 1 year ago

@tloubrieu-jpl using AWS server for this. I would think the 6 million is the right answer and maybe when the window gets too big some node(s) fall out of the game. Implies they are not all configured the same. All guesses mind you but where I would start to collect evidence.

al-niessner commented 1 year ago

@jimmie @jordanpadams @tloubrieu-jpl

Demise of Plan B

Since the provenance of the data is incredibly flat (6+ million lidvids with 2500 less lids) using another index would double the data storage requirements. @jimmie noted that this could be a problem and searching for a less resource hungry solution may be worth it. Spoke in person with @tloubrieu-jpl who suggested a 'latest' flag (true/false) that could then be used. The problem with the 'latest' flag is that of Plan A in that 6+ million entries require an update which takes months dictating a change to harvest. Further discussion led us to Plan C.

Plan C

Idea

While updating 6+ million documents after every ingest seems daunting to say the least, better to update 2500 provenance documents with 'provenance = superseding lidvid' (hint: this can help the lid change problem too). The time to read 6 million lidvids from the current AWS cluster arrangement is about 15 minutes but this will grow as more data is added but is necessary because staged (see snags below). Takes seconds to reduce that to newest/oldest. Will be shorter to update 2500 than 6 million documents.

Snags

Will be updating documents ingested via harvest. Harvest itself, requires no change.

Registry manager will need to be updated to add 'provenance' to the index as an optional (may not exist in all documents). After it is inserted, reindex all registry indexes (@jimmie). Until reindexing is done, can use a post search filter but this is a performance hit. At 100s of items probably not noticeable. At 10s of thousands and more, it will eat memory and CPU resources and slow down response times noticeably.

For @jordanpadams and @tloubrieu-jpl, assuming this script is set to run every hour, should it ignore staged items? For instance, if a latest is bumped to provenance by a staged product then searching via registry API would see all lids as provenance. What behavior do you expect when items are staged?

tloubrieu-jpl commented 1 year ago

Hi @al-niessner ,

We been discussing an iteration of plan C with @jordanpadams .

The idea is to use the 'archive_status' property to track provenance:

when a document is the latest, the architve_status will contain 'archived'
when a document is not the latest, 'archived' will be replaced with 'superseded'

We will need in the future a clearer ontology of archive status --> @jshughes

For now we would like to use that, since it sounds like it has the least impact o the current system.

I don't know if we still want a cronjob to do that updates since in between the insertion of the document and the script being run, we will have multiple lids with different lidvid and status=archived, but we can discuss that more.

The way to sort VIDs is also an open question, should we convert it to a float ? That would not work always e.g. 1.9 < 1.10 We can find code which is handling that in java (see https://gist.github.com/adamcbuckley/8ccae8b1ede0a65edb1756ea39bb6f2a) or python but I don't know if anything can help to do that internally in opensearch...

tloubrieu-jpl commented 1 year ago

We can sort by splitting the version on 2 integers major/minor versions and sort on the 2 fields.

al-niessner commented 1 year ago

@jordanpadams @tloubrieu-jpl

About using archive_status:

It is part of index already so is searchable - medium plus
It is used and defined for another problem making it out of control for this usage (may change in a way incompatible with these needs). For instance, it currently has 3 values: a. archived b. certified c. staged

What happens when superseded is not compatible with those values?

Given the current 3 values, should the mapping be latest == certified and archived == superseded? I do not know the average or default value of the archive_status in the AWS system now nor what harvest does (always staged or something else) but if millions of entries need to be updated for latest then it has the problem of Plan A (not scalable). If some archived need to be certified for reasons out of latest/provenance reasons then that makes a bigger mess yet again. For instance can it be provenance certified or must it be latest to be certified implying that once superseded it loses its certification?

No, I recommend plan C as originally described instead of setting us to do it again in the future when we lose control of archive_status. I do appreciate trying to used other stuff like the attempt to process lidvid information in registry-api, but it is not going to work out because it was designed for a different problem.

Best time to reindex would be move from current cluster to new multi-tenancy. I would bet nobody will notice the post filtering before then because of cluster connect lags anyway.

tloubrieu-jpl commented 1 year ago

Ok, I got your point @al-niessner .

What if we ask @jshughes to clarify the archive_status ontology ? That might take some time since he would have to go through the DDWG group in charge of defining the usage of the PDS4 standard. The reason I was proposing that is that some users already proposed to have 'superseded' as an archive status.

But as you said, I agree we should clarify the details on what the values can be in archive_status, otherwise ignore it and make a new 'technical' attribute, provenance. Then wouldn't the value be: 'superseded by {lidvid}' ? instead of 'superseding {lidvid}' ?

al-niessner commented 1 year ago

@tloubrieu-jpl

I have no problem with waiting but the fact we are waiting strengthens my point - the definitions are not in our control nor will we be considered a stake holder.

As far as the field name goes, it does not matter to me. I presume it would have to be 'ops:Provenance/ops:superseded_by:{lidvid}' where if it is not defined then it is the latest. It would also solve the renaming problem since that is also a simple case of provenance and being superseded.

tloubrieu-jpl commented 1 year ago

Hi @al-niessner , @jordanpadams

Sorry for the delay in the reply. I needed to put my thoughts together, since there are different things going on:

archive_status: this is not a PDS4 property, would that mean that our hands are free to use it ? I mean the DDWG does not have a governance on it ? @jordanpadams ? If we are free, why not propose a simplified workflow staged --> archived --> superseded and remove certified and restricted ?
alternate_ids: for the renamed lids, Eugene started to propose and implement the alternate_id property. We need to either stay on that track or propose a whole new solution.
provenance: as proposed by @al-niessner, should we start something new ? That will be needed but is it our job as Engineering node to start something on that ?

One more question for @jordanpadams: @al-niessner has noticed that the registry contains very few superseded products (2500 / 6M+). Is that something we can consider as stable or, since the registry is new, will that evolve in time where the majority of products will be superseded, after a while.

tloubrieu-jpl commented 1 year ago

Note, I started a note on the registry index design https://docs.google.com/document/d/1LtqmdflkO3ZpXlSuQfOeABSbM5nVFTkxGlJSsxM1a80/edit# It is mostly empty but I believe we need somekind of documentation not to disrupt the previous design when a bug fix or new feature is requested.

al-niessner commented 1 year ago

@jordanpadams @tloubrieu-jpl

The provenance I am suggesting works as alternate_ids as well. After all if foo::1.4 is renamed to bar::1.0 then provenance pointing at its the lidvid successor would mean foo::1.4 {ops:Provenance/ops:Superseded_by} -> bar::1.0

al-niessner commented 1 year ago

@jordanpadams @tloubrieu-jpl

Have a working script for Plan C as proposed. Going to test it on AWS. Should be able to clean it up easily if we want to go another pass. Script is pretty pliable to any of the other options as well.

al-niessner commented 1 year ago

@jordanpadams @tloubrieu-jpl

Ran to completion with no problems. Here is the output:

$ support/provenance.py -b https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com -c naif-prod-ccs rms-prod sbnumd-prod-ccs geo-prod-ccs atm-prod-ccs sbnpsi-prod-ccs img-prod-ccs -L INFO -u **** -p ****
2022-12-07 14:29:27,995::INFO::starting CLI processing
2022-12-07 14:29:27,995::INFO::start trolling
2022-12-07 14:29:32,533::INFO::   progress: 10000 of 2302854 (0%)
2022-12-07 14:29:36,325::INFO::   progress: 20000 of 2302854 (1%)
2022-12-07 14:29:40,973::INFO::   progress: 30000 of 2302854 (1%)
2022-12-07 14:29:42,642::INFO::   progress: 40000 of 2302854 (2%)
2022-12-07 14:29:46,449::INFO::   progress: 50000 of 2302854 (2%)
>< snip ><
2022-12-07 14:40:14,328::INFO::   progress: 2270000 of 2302854 (99%)
2022-12-07 14:40:16,633::INFO::   progress: 2280000 of 2302854 (99%)
2022-12-07 14:40:19,372::INFO::   progress: 2290000 of 2302854 (99%)
2022-12-07 14:40:21,405::INFO::   progress: 2300000 of 2302854 (100%)
2022-12-07 14:40:22,335::INFO::   progress: 2302854 of 2302854 (100%)
2022-12-07 14:40:22,755::INFO::finished trolling
2022-12-07 14:40:22,757::INFO::starting search for history
2022-12-07 14:40:22,757::INFO::   reduce lidvids to unique lids
2022-12-07 14:40:25,047::INFO::   aggregate lidvids into lid buckets
2022-12-07 14:40:27,615::INFO::   process those with history
2022-12-07 14:40:27,788::INFO::found 1898 products needing update of a 3332 full history of 2302854 total products
2022-12-07 14:40:28,044::INFO::Bulk update 1898 documents

Total time from command line looks like:

real    11m1.135s
user    0m17.990s
sys 0m0.968s

All the time is in retrieving the current state of the database or whatever we are calling it. It took 7 seconds to do everything else. If we are lucky the time will increase linearly with size of data.

jshughes commented 1 year ago

About the attribute pds:Archive_Status, there has been a LOT of debate over its values and value_meaning. The scope of the attribute was primarily limited to the phases in the ingest process but was complicated by “accumulating” data sets. The basic phases are ARCHIVED, IN_LIEN_RESOLUTION, IN_PEER_REVIEW, IN_QUEUE, LOCALLY_ARCHIVED, PRE_PEER_REVIEW. The value SAFED indicates a data set that the PDS agreed to safe-hold, but that was never submitted to the ingest process. SUPERSEDED, as noted, indicates a data set that has been replaced by another data set.

Early on there was a little discussion about adding pds:Archive_Status to PDS4 labels. It was decided to not add the attribute to the labels primarily because of the problem of syncing the value with the phases in the actual ingestion process. For example, the value “ARCHIVED” is only logically valid after the Product has completed the ingestion process. After completion however, setting the value to “ARCHIVED” would force a versioning of the Product, resulting in a new label.

The DDWG fall-back position was to punt the management of this attribute to operations and to manage it in the registry.

Bottom line, I would not touch the name, definition, values, value_meanings, or requirements of the attribute pds:Archive_Status.

However, creating a new attribute for operational requirements has always been the prerogative of operations. The “ops” steward in the IM was created for this purpose, suggesting that ops:Archive_Status could be added. Using a different attribute name might be prudent. Some of the “pds:” values and value meanings could be repurposed if useful.

Also it would probably be better to submit this "ops:" attribute for DDWG review and CCB approval, only after it is working and well tested.

tloubrieu-jpl commented 1 year ago

Thanks @jshughes for the detailed report on that. I think we already use the ops:archive_status but we did not formalize yet how we want to use it. The authorized values (as in the registry-manager tool) are:

staged
archived
certified
restricted

which is a mix of concepts on lifecycle and authorization, we could add superseded to the list. That would not hurt. But that woud also help if we simplify what we use the ops:archive_status for.

jordanpadams commented 1 year ago

@al-niessner LID mapping we are aware of: https://github.com/NASA-PDS/pds-api/files/10118162/lidChange.csv

csv format:

previous_lid, new_lid

for the LIDs, assume previous_lid is latest version, and new_lid is oldest version

tloubrieu-jpl commented 1 year ago

ops:archive_status becomes ops:life_cycle_status: is used as tracking status The status we handle are:

staged (visible internally)
releasing (to be processed by EN)
released (visible to the users),

When archive_status is switched from staged to releasing then the latest related properties (ops:provenance) need to be updated and archive_status is updated to release.

alternate_id is not used and should not be used anymore

tloubrieu-jpl commented 1 year ago

@tloubrieu-jpl need to create a ticket on registry-manager to translate the impacts of the comment above in the tool.

al-niessner commented 1 year ago

@jimmie @jordanpadams @tloubrieu-jpl

Sigh, another snag. Been trying to figure out what is wrong with the code that post filtering is not working. Seems that opensearch killed post filtering. It means this fix is not going to work until the keyword is rolled into the index. Updating my local index for testing. Will update NASA-PDS/registry as well.

jordanpadams commented 1 year ago

we can close this requirement as implementation complete. the following tickets will complete the deployment of these updates:

https://github.com/NASA-PDS/registry/issues/140 https://github.com/NASA-PDS/registry/issues/141

NASA-PDS / registry-api

As a user, I want to query only the latest versions of products unless explicitly requested #441

💪 Motivation

📖 Additional Details

Motivation

Additional Details

Acceptance Criteria

⚙️ Engineering Details

Plan A

Idea

Why it failed

Plan B

Idea

Snags

Demise of Plan B

Plan C

Idea

Snags