NASA-PDS / registry-api

Web API service for the PDS Registry, providing the implementation of the PDS Search API (https://github.com/nasa-pds/pds-api) for the PDS Registry.
https://nasa-pds.github.io/pds-api
Apache License 2.0
2 stars 5 forks source link

As an API user, I want to specify whether I get the latest or all versions of a product #447

Closed nutjob4life closed 2 years ago

nutjob4life commented 3 years ago

🌬 Motivation

PDS labels form references with collections and other labels in a bundle through two mechanisms: one is by specifying a full logical identifier + version identifier, or "lidvid", such as:

<Bundle_Member_Entry>
    <lidvid_reference>urn:nasa:pds:insight_documents:document_mission::2.0</lidvid_reference>
    <member_status>Primary</member_status>
</Bundle_Member_Entry>

Another is with a reference just to the logical identifier, or "lid"; for example:

<Bundle_Member_Entry>
    <lid_reference>urn:nasa:pds:ladee_mission:xml_schema_collection</lid_reference>
    <member_status>Primary</member_status>
</Bundle_Member_Entry>

In the former, a single referenced collection is indicated; in the latter, there's a choice: do we want all versions of the collection or just the latest? When searching for collections within a bundle, it would be great to be able to add a search parameter that gives that choice to the client.

🕵️ Additional Details

The use case for this is in registry version of the PDS Deep Archive. When examining "lid-only" references, the PDS Deep Archive can generate two kinds of Archive Information Packages and Submission Information Packages:

A command-line parameter, --include-latest-collection-only, turns on the second behavior.

Because the API service (and the ElasticSearch behind it) is in an ideal position to do this distinction, it should provide it as a feature. This would also reduce the over-the-wire data transferred from the API—which can be problematic for huge bundles (think insight_cameras).

Of course, if it already does support this, please close this issue and tell me how! 😊

Acceptance Criteria

See sub-tickets to this Epic for specific acceptance criteria addressing these two use cases

jordanpadams commented 3 years ago

@nutjob4life would you say this is a blocker for NASA-PDS/pds-deep-archive#106, or would it be a performance improvement?

additionally, I imagine a workaround could be to sort by version_id and take the first record?

nutjob4life commented 3 years ago

It's only a blocker if people use the -include-latest-collection-only flag. Any ideas how often that's used on the filesystem version of PDS Deep Archive?

Regardless, I really feel this is a server-side responsibility. Consider this analog:

cursor = connection.cursor()
cursor.execute('SELECT * FROM invoices')
invoices = cursor.fetchall()
invoices.sort(lambda a, b: a.date < b.date)
latest = invoices[-1]

versus:

cursor = connection.cursor()
cursor.execute('SELECT * FROM invoices ORDER BY date DESC LIMIT 1')
latest = cursor.fetchone()

The former sends the entire set of invoices over the wire for the code to sort and pick just one; the latter lets the SQL server do what it does best and sends just the needed invoice over the wire.

Maybe it's not a big deal, but then again insight_cameras was pretty shocking with its 2 day run time on the first release of PDS Deep Archive 😅 And filesystems are fast compared to HTTP!

tloubrieu-jpl commented 3 years ago

@nutjob4life @jordanpadams for now the API returns only the latest labels when a lid is requested. We could extend this behavior to return all the matching lidvid if that is what you need (I am not sure).

Otherwise, there are certainly performance optimization possible in the code. We could look at the most critical request (duration*number of calls) you are doing ? Lastly an option can be to increase the number of servers (@jimmie). That should be easy with fargate.

Thanks

nutjob4life commented 3 years ago

Ah, interesting.

Okay, given that, we could modify pds-deep-registry-archive's usage message so it says the --include-latest-collection-only is always on when using it. (Or even just remove the --include-latest-collection-only parameter and mention in the --help that it only ever gets the latest product for LID-only references.)

jordanpadams commented 3 years ago

@nutjob4life the concern here is when a bundle references collections by LID, we have to have some ability for someone to include older versions of a collection.

I am assuming that since we have this flag in pds-deep-archive, the default functionality is to include all versions of the collection. in which case, we need some ability to ask the API, I want all versions.

@tloubrieu-jpl do we have a capability to ask for all versions of a product

tloubrieu-jpl commented 3 years ago

@jordanpadams @nutjob4life , we can not get multiple versions of a lid yet but we can develop that. I will create a ticket for that now. https://github.com/NASA-PDS/registry-api/issues/438

jordanpadams commented 3 years ago

@tloubrieu-jpl can we specify a LIDVID via the API to get a specific past version?

jordanpadams commented 3 years ago

resolved per https://github.com/NASA-PDS/pds-api/commit/7472686bb9b9562c5327f3273b57d6f22b3e603d and https://github.com/NASA-PDS/registry-api-service/pull/67

nutjob4life commented 2 years ago

The latest API does add some _all and _latest calls (as of 2021-09-22):

However for the generation of Deep Archives, what we need is:

These are not yet available listed in the swagger.json file, so re-opening this issue until we can get these two methods.

While we're at it, these two might be helpful as well:

if there is ever a case where a collection has a reference to a product that is LID-only. I haven't seen it yet, but I guess it's feasible 🤷‍♀️

jordanpadams commented 2 years ago

initial implementation has been completed. improvement will be worked here: https://github.com/NASA-PDS/pds-api/issues/126

jpl-jengelke commented 2 years ago

@jordanpadams @tloubrieu-jpl Do you want the API exercised directly, or is this something where we will test these API improvements by running the deep-archive tool? If that's the case would it be appropriate to categorize it under the Deep-archive tests? Perhaps this is a template if we will use a tool for validation: https://github.com/NASA-PDS/deep-archive/issues/7#issuecomment-808980182 .

tloubrieu-jpl commented 2 years ago

Hi @jpl-jengelke , this can be tested by excercise directly the API.

On this side, we are used to work with postman (or nextman for automated tests) to write the tests. Unfortunatly, I don't have a stable version of the test suite because we don't have a stable test dataset either. Recently we integrated this reference test dataset for our dev deployments (https://pds-gamma.jpl.nasa.gov/data/pds4/test-data/registry/urn-nasa-pds-insight_rad.tar.gz) but it does not cover all the test cases we are looking for.

Hofefully your work on test automation and the reference test datasets that you want to consolidate will help us forward in that direction.

jordanpadams commented 2 days ago

Requirement is no longer valid. Rewritten in a different form.