NASA-PDS / deep-archive

PDS Open Archival Information System (OAIS) utilities, including Submission Information Package (SIP) and Archive Information Package (AIP) generators
https://nasa-pds.github.io/deep-archive/
Other
7 stars 4 forks source link

Deep archive working with registry produces unexpected results with primary vs secondary products #164

Open nutjob4life opened 3 months ago

nutjob4life commented 3 months ago

Checked for duplicates

Yes - I've already checked

πŸ› Describe the bug

For context, please see this ticket and the discussion that followed.

πŸ•΅οΈ Expected behavior

Honestly, not sure what to expect and I'm leaving on vacation today but hope @jshughes can provide some details in the meantime.

πŸ“œ To Reproduce

See this ticket.

πŸ–₯ Environment Info

πŸ“š Version of Software Used

No response

🩺 Test Data / Additional context

No response

πŸ¦„ Related requirements

πŸ¦„ #50

βš™οΈ Engineering Details

In PDS4, collections can be either primary or secondary members of a bundle. A primary member essentially means as far as the archive is concerned, this is where the collection "resides" in the archive forever. A secondary member can essentially be thought of a symlink to a collection that does not technically belong to that bundle. More like a reference for informational purposes for the data user.

Since the API will not be updated in the near term to support this, let's hack this by looking at the LID of the products returned.

Here is some pseudocode:

bundle_lid = bundle_lidvid.split('.')[0]
for collection in get(/products/bundle_lidvid/members):
  if bundle_lid in collection.get("lid"):
     add_to_aip(collection)
     add_to_sip(collection)
     for product in get(/products/collection.get("lidvid")/members):
        if collection.get("lid") in product.get("lid"):
           add_to_aip(product)
           add_to_sip(product)

πŸŽ‰ Integration & Test

No response

jordanpadams commented 3 months ago

@nutjob4life added some details to the ticket

nutjob4life commented 3 months ago

@jordanpadams thanks, much appreciated!

tloubrieu-jpl commented 2 months ago

Need a work around since the API cannot be ready soon for that.

jordanpadams commented 1 month ago

@nutjob4life I updated the ticket with the proposed / hacked solution

jordanpadams commented 1 month ago

@nutjob4life note: this is still blocked by https://github.com/NASA-PDS/registry/issues/185

nutjob4life commented 1 month ago

@jordanpadams copy that; standing by

nutjob4life commented 1 month ago

The "develop" branch of the registry loads data from this repository

There is one secondary collection

Thanks to @jordanpadams for pointing this out … and for providing this file over Slack

nutjob4life commented 1 month ago

Note to self: to load the file in this comment:

Start the registry

$ git clone https://github.com/NASA-PDS/registry.git
$ cd registry/docker/certs
$ ./generate-certs.sh
$ cd..
$ docker compose --profile=int-registry-service-loader up

Let that run for a while as it does its thing. Eventually you'll see

docker-elasticsearch-1 | [DATE][INFO ][o.o.j.s.JobSweeper] [ID] Running full sweep

meaning things have more or less gone idle. Leave this running in a terminal session.

Load the file

In a new terminal session, unzip the urn-nasa-pds-insight_rad.zip file from the above comment in /tmp:

$ cd /tmp
$ unzip urn-nasa-pds-insight_rad.zip

Then create /tmp/harvest-config.xml with the following contents:

<?xml version='1.0' encoding='UTF-8'?>
<harvest nodeName='PDS_ENG'>
    <directories>
        <path>/mnt/urn-nasa-pds-insight_rad</path>
    </directories>
    <registry url='https://elasticsearch:9200' index='registry' auth='/etc/es-auth.cfg'/>
    <fileInfo>
        <fileRef replacePrefix='/mnt/urn-nasa-pds-insight_rad' with='http://localhost:81/archive'/>
    </fileInfo>
    <autogenFields/>
</harvest>

Finally, back in /where/ever/registry/docker, run:

$ docker compose --profile int-registry-batch-loader run \
    --rm --entrypoint harvest \
    --volume /tmp/harvest-config.xml:/mnt/harvest-config.xml \
    --volume /tmp/urn-nasa-pds-insight_rad:/mnt/urn-nasa-pds-insight_rad \
    registry-loader-test-init \
    -c /mnt/harvest-config.xml --overwrite

And eventually you'll see:

[INFO] Wrote 2 collection inventory document(s)
[INFO] Wrote 25 product(s)
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 25
[SUMMARY]   Product_Bundle: 1
[SUMMARY]   Product_Collection: 3
[SUMMARY]   Product_Observational: 21
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: 8c59a344-a141-4d73-978c-4241089d7deb

Query the API

Running

curl --request 'GET' \
    --header 'Accept: *' \
    --fail-with-body \
    'http://localhost:8080/products/urn%3Anasa%3Apds%3Ainsight_rad'

should then give you a valid response but gives 404 for some reason.

So forget all of the above and just use the test data without loading urn-nasa-pds-insight_rad.zip because it happens to include this already! 🀨

nutjob4life commented 1 month ago

@jordanpadams I think I need a little help in reproducing this.

I've started up a local Registry API loaded with test data and queried it with curl just to make sure urn:nasa:pds:insight_rad::2.1 is in there:

curl --request 'GET' --header 'Accept: *' --fail-with-body \
    'http://localhost:8080/products/urn%3Anasa%3Apds%3Ainsight_rad%3A%3A2.1' \
    | json_pp

And sure enough I see

{
   "id" : "urn:nasa:pds:insight_rad::2.1",
…
      "pds:Bundle_Member_Entry.pds:lid_reference" : [
         "urn:nasa:pds:insight_rad:data_raw",
         "urn:nasa:pds:insight_rad:data_calibrated",
         "urn:nasa:pds:insight_rad:data_derived",
         "urn:nasa:pds:insight_documents:document_hp3rad"
      ],
      "pds:Bundle_Member_Entry.pds:member_status" : [
         "Primary",
         "Primary",
         "Primary",
         "Secondary"
      ],
…
}

I run pds-deep-registry-archive:

pds-deep-registry-archive --url http://localhost:8080 --site PDS_ENG \
    urn:nasa:pds:insight_rad::2.1

Checking the primary references:

$ egrep -c 'data_raw|data_calibrated|data_derived' *.tab
insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab:48
insight_rad_v2.1_20240719_sip_v1.0.tab:48
insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab:48

But checking the secondary reference:

$ egrep -c document_hp3rad *.tab
insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab:0
insight_rad_v2.1_20240719_sip_v1.0.tab:0
insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab:0

So … working as intended? Or maybe I'm just not "getting" it?

jordanpadams commented 1 month ago

@nutjob4life so this may be working for bundles then, but it does not for the underlying collections. If you grep for "test" in the .tab, they should show up. If not, this may be because deep-archive skipped it because the files don't actually exist on the file system. You may need to create those test files to make that work. Sorry. Not familiar enough with with how the registry and API work.

nutjob4life commented 1 month ago

Okay, thanks @jordanpadams. Let me unpack what you said:

If you grep for "test" in the .tab, they should show up

They don't:

$ egrep -c test *.tab
insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab:0
insight_rad_v2.1_20240719_sip_v1.0.tab:0
insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab:0

If not, this may be because deep-archive skipped it because the files don't actually exist on the file system

We're talking pds-deep-registry-archive, not pds-deep-archive; and pds-deep-registry-archive doesn't look at the filesystem; it uses the Registry API.

Isn't that the crux of this ticket? That the pds-deep-registry-archive produces different results from the filesystem version because of the API does not convey "secondary"-ness?

nutjob4life commented 1 month ago

FYI @jordanpadams, thanks for the pseudocode. I've implemented in as follows:

    bundlelid = bundlelidvid.split(".")[0]  # @jordanpadams, here

    for collection in _getproducts(url, bundlelidvid, allcollections):
        if bundlelid in collection['properties']['lid']:  # @jordanpadams, and here
            _addfiles(collection, bac)
            for product in _getproducts(url, collection["id"]):
                if collection['properties']['lid'] in product['properties']['lid']:  # @jordanpadams, and finally here
                    _addfiles(product, bac)

It results in trimmed down .tab files:

$ wc -l *.tab
       2 insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab
       2 insight_rad_v2.1_20240719_sip_v1.0.tab
       2 insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab
       6 total

So I think I'm definitely missing the point! Should we split on :: instead of .?

I've got to put out a CrowdStrike fire so I'll check back on this ticket over the weekend πŸ˜…

nutjob4life commented 1 month ago

Tried with :: instead of . and again get the svelte 2-line .tab files

Will try again when my thinking is clearer later on

Not familiar enough with with how the registry and API work

Hah! Welcome to my world 😏

nutjob4life commented 1 month ago

Update: multiple attempts to load custom data into a local registry have met with frustrating failure

Will try again on Monday

nutjob4life commented 1 month ago

Monday update: okay, so I finally figured out why my specifically-crafted test data isn't getting loaded: file: URLs are not supported πŸ™„

jordanpadams commented 1 month ago

@nutjob4life per one of your comments above, we should split on ::

bundlelid = bundlelidvid.split("::")[0]
nutjob4life commented 1 month ago

@nutjob4life per one of your comments above, we should split on ::

Yep, tried it

jordanpadams commented 1 month ago

@nutjob4life here is an updated data set with those test foo/bar products actually included, and the LIDs updated to be valid.

urn-nasa-pds-insight_rad.zip

nutjob4life commented 1 month ago

Oh okay, going from urn:nasa:pds:test:foo β†’ urn:nasa:pds:test:foo:foo, urn:nasa:pds:test:bar β†’ urn:nasa:pds:test:bar:bar::1.0 … will give that a shot shortly

nutjob4life commented 1 month ago

@jordanpadams okay, loaded the ZIP file from your comment into my local registry and generated a deep archive against it; here's what I get:

mirasol 240 % .v/bin/pds-deep-registry-archive --url http://localhost:8080 --site PDS_ENG urn:nasa:pds:insight_rad::2.1
INFO πŸ‘Ÿ PDS Deep Registry-based Archive, version 1.3.0
INFO πŸ“„ Wrote AIP checksum manifest insight_rad_v2.1_20240722_checksum_manifest_v1.0.tab with 50 entries
INFO πŸ“„ Wrote AIP transfer manifest insight_rad_v2.1_20240722_transfer_manifest_v1.0.tab with 50 entries
INFO πŸ“„ Wrote label for them both: insight_rad_v2.1_20240722_aip_v1.0.xml
INFO πŸ“„ Wrote SIP insight_rad_v2.1_20240722_sip_v1.0.tab with 50 entries
INFO πŸ“„ Wrote label for SIP: insight_rad_v2.1_20240722_sip_v1.0.xml
INFO πŸ‘‹ Thanks for using this program! Bye!
mirasol 241 % egrep 'foo|bar' *.tab
mirasol 242 % egrep -c 'urn:nasa:pds:test' *.tab
insight_rad_v2.1_20240722_checksum_manifest_v1.0.tab:0
insight_rad_v2.1_20240722_sip_v1.0.tab:0
insight_rad_v2.1_20240722_transfer_manifest_v1.0.tab:0
mirasol 243 % 

The registry never emitted the secondaries. I ran it again, turning on --debug logging, so I can grab the actual URL the Deep Archive uses:

DEBUG Making request to http://localhost:8080/products/urn:nasa:pds:insight_rad:data_raw::8.0/members/all with params {'sort': 'ops:Harvest_Info.ops:harvest_date_time', 'limit': 50}

And try it myself with curl:

mirasol 254 % curl --silent 'http://localhost:8080/products/urn:nasa:pds:insight_rad:data_raw::8.0/members/all' | json_pp | egrep -c 'urn:nasa:pds:test'
0
mirasol 255 %

All the primaries are there, sure:

mirasol 259 % curl --silent 'http://localhost:8080/products/urn:nasa:pds:insight_rad:data_raw::8.0/members/all' | json_pp | egrep -c 'urn:nasa:pds:insight_rad:data_raw:hp3_rad_raw_00' 
42
mirasol 260 % 

That makes sense; there are 7 primaries in the collection_data_rad_raw.csv file, and 42 Γ· 7 = 6 (they get mentioned 6 times each in the output).

Takeaway: my local registry is smart enough to not say a single peep about secondaries or something else is going on that ticket.

PS: Just to make sure I wasn't using some other registry or was using older data or was otherwise confused, I edited the collection_data_rad_raw.csv and removed all the primaries (leaving the two S, lines), re-created the .tar.gz file, re-hosted it on a web-server, destroyed my Docker composition/volumes/orphans, and re-started it with the profile that re-loads all the data. And this time those primaries are gone too, proving I was using the updated data:

mirasol 271 % curl --silent 'http://localhost:8080/products/urn:nasa:pds:insight_rad:data_raw::8.0/members/all' | json_pp | egrep -c 'urn:nasa:pds:insight_rad:data_raw:hp3_rad_raw_00'
0

And the secondaries still don't appear:

mirasol 272 % curl --silent 'http://localhost:8080/products/urn:nasa:pds:insight_rad:data_raw::8.0/members/all' | json_pp | egrep -c 'urn:nasa:pds:test'
0

Nor do they appear in the generated deep archive, as before:

mirasol 274 % .v/bin/pds-deep-registry-archive --quiet --url http://localhost:8080 --site PDS_ENG urn:nasa:pds:insight_rad::2.1
mirasol 275 % egrep 'foo|bar' *.tab
mirasol 276 % egrep -c 'urn:nasa:pds:test' *.tab
insight_rad_v2.1_20240722_checksum_manifest_v1.0.tab:0
insight_rad_v2.1_20240722_sip_v1.0.tab:0
insight_rad_v2.1_20240722_transfer_manifest_v1.0.tab:0

My Docker composition is running nasapds/registry-api-service:1.4.0. Should I be trying a different version? 1.4.0 might have a strong prejudice against secondaries or something? 😜

jordanpadams commented 1 month ago

@nutjob4life copy that. Ok, maybe let's pause on this ticket and jump back to the Wordpress CD work for now until we have the new working registry and API up and running to test against.

nutjob4life commented 1 month ago

@jordanpadams sure thing.

Say, is there a chance we can find out how in the other ticket they're invoking pds-deep-registry-archive?

jordanpadams commented 1 month ago

@nutjob4life I could ask, but, honestly, I doubt they remember at this point. I also think this person has moved positions.

For now, I would just say let's pause and see how the software operates with the new registry and API up and running. If we can't reproduce, and the issue happens again, then we can go from their with fixing however they are running the software.

nutjob4life commented 1 month ago

@jordanpadams okie doke … letting go for now