NASA-PDS / operations

Tickets for the PDSEN Operations Team
Other
5 stars 1 forks source link

Deployment of Registry Loader Tools and Initial Ingestion of Engineering Node Registry on Latest AWS Deployment #270

Closed tloubrieu-jpl closed 1 year ago

tloubrieu-jpl commented 2 years ago

💡 Description

List of products to ingest:

rchenatjpl commented 2 years ago

@tloubrieu-jpl or anyone: sorry for the late start. 1) Am I really installing on pdscloud-prod2? Shouldn’t I do this on pdscloud-gamma? 2) On ~pds4 on both machines, java -version says 1.8. Should I install my own jdk 1.11 in my local dir? Thanks

tloubrieu-jpl commented 2 years ago

Hi @rchenatjpl , you don't look late to me.

1) I think we said you should install on pdscloud-prod1, but 2 works as well. You don't need any local access to databases. The access to OpenSearch is done through HTTP and OpenSearch is a service managed by AWS, hosted on a different system, harvest and registry-manager could be installed anywhere. You could install on pds-gamma but then you would need to configure the deployment to use a staging registry. I am not sure if one is accessible (@jimmie can you answer that). I am not sure if we need to do that right now so to me we can keep the initial plan to deploy on the production venue.

As a reminder, the different common EN venues on the diagram:

image

2) @c-suh do you know if jdk11 is available on the pdscloud-* machines ? Which version are you using for validate ? In case it is not available yet, you could deploy that in the PDS4 home folder. I don't feel lie we need to ask the SA's for that. For AWS deployments, I am leaning toward: anything that the SAs are letting us do, let's do it ourselves. Does that make sense ?

Thanks Richard,

Thomas

c-suh commented 2 years ago

@tloubrieu-jpl jdk11 wasn't available on the pdscloud-* machines, so I've installed it on gamma to start and will have it on the others by the end of today. A note that installing it to where the existing java is was not possible because it required root access, so I did it to the pds4 home directory as you suggested above. Another note that I've also installed jenv to manage multiple java versions (leaving the existing jdk8 as global and will make jdk11 the default in the directories that Richard will create/install), since I imagine that a few of our older tools might not play well with this java version.

rchenatjpl commented 2 years ago

OK, I shoved harvest and registry into pds4cloud-prod1:/usr/local/build11/. In the Description at the top of this page, I don't understand what this means: "in /usr/local/build11/ --> /usr/local/applications". Should I build a softlink in the latter? Hopefully, it doesn't matter.

@jimmie or @tloubrieu-jpl What's the URL for Kibana on pds4cloud-prod1? I did get the password. How does Kibana know where the registry sits?

jimmie commented 2 years ago

The Opensearch dashboard (fka Kibana) is associated with the particular Opensearch domain. For the EN registry, this is at: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com/_dashboards

rchenatjpl commented 2 years ago

Thanks. So I assume something's not configured correctly:

[pds4@pdscloud-prod1 test]$ ls -l directories.xml -rw-r--r--. 1 pds4 pds 1866 Jun 16 15:07 directories.xml [pds4@pdscloud-prod1 test]$ pwd /home/pds4/test [pds4@pdscloud-prod1 test]$ /usr/local/build11/harvest-3.6.0/bin/harvest -c directories.xml [SUMMARY] Reading configuration from /data/home/pds4/test/directories.xml [SUMMARY] Output directory: /tmp/harvest/out [SUMMARY] Elasticsearch URL: http://localhost:9200, index: registry [INFO] Connecting to Elasticsearch [ERROR] Connection refused

rchenatjpl commented 2 years ago

@tloubrieu-jpl Wait, should I create the registry first?

[pds4@pdscloud-prod1 ~]$ /usr/local/build11/registry-manager-4.4.0/bin/registry-manager create-registry Elasticsearch URL: http://localhost:9200 Creating index... Index: registry Schema: /usr/local/build11/registry-manager-4.4.0/elastic/registry.json Shards: 1 Replicas: 0 [ERROR] Connection refused

localhost:9200 is some elastic search thing? How do I start that, presumably on pdscloud-prod1? Thanks

jimmie commented 2 years ago

No, the registry is already created. Just need to load the data.

The Opensearch endpoint (i.e. the 'Elasticsearch URL') for the production EN registry is https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443

rchenatjpl commented 2 years ago

@jimmie @tloubrieu-jpl Ah. So in the harvest config file, I should change

  <registry url="http://localhost:9200" index="registry" />
to
  <registry url="https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443" index="registry" />

I hope that's right. Now I'm getting [pds4@pdscloud-prod1 test]$ /usr/local/build11/harvest-3.6.0/bin/harvest -c directories.xml [SUMMARY] Reading configuration from /data/home/pds4/test/directories.xml [SUMMARY] Output directory: /tmp/harvest/out [SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry [INFO] Connecting to Elasticsearch [ERROR] method [GET], host [https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443], URI [/registry/_mappings], status line [HTTP/1.1 403 Forbidden] {"Message":"User: anonymous is not authorized to perform: es:ESHttpGet because no resource-based policy allows the es:ESHttpGet action"}

So I need to set the user? I don't see an option on the harvest command line or in the config file

jimmie commented 2 years ago

The Harvest documentation is here (at least for v1.0.3): https://nasa-pds.github.io/pds-registry-app/operate/harvest.html

In the config:

The last element is the auth file, which has the form of:

user= password=

Put in the credentials for your Opensearch login that I LFT'd to you (and you changed password). Enter the full path to this file in the Harvest config.

tloubrieu-jpl commented 2 years ago

@rchenatjpl the link @jimmie sent is on the old documentation, the new one is here https://nasa-pds.github.io/registry/

For the authentication management in harvest job configuration file you can refer to https://nasa-pds.github.io/registry/user/harvest_job_configuration.html#registry-integration

tloubrieu-jpl commented 2 years ago

@jimmie I redirected all the page of the obsolete repository to the new documentation, see https://nasa-pds.github.io/pds-registry-app/operate/harvest.html so that anyone who kept these pages in their bookmark is redirected to the new page. That was true, from the landing page of the documentation but not from every pages

rchenatjpl commented 2 years ago

@tloubrieu-jpl @jimmie [pds4@pdscloud-prod1 test]$ /usr/local/build11/harvest-3.6.0/bin/harvest -c directories.xml [SUMMARY] Reading configuration from /data/home/pds4/test/directories.xml [SUMMARY] Output directory: /tmp/harvest/out [SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry [INFO] Connecting to Elasticsearch [ERROR] method [GET], host [https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443], URI [/registry/_mappings], status line [HTTP/1.1 403 Forbidden] {"Message":"User: anonymous is not authorized to perform: es:ESHttpGet because no resource-based policy allows the es:ESHttpGet action"}

[pds4@pdscloud-prod1 test]$ grep auth directories.xml

[pds4@pdscloud-prod1 test]$ cat /home/pds4/test/auth.txt

true - trust self-signed certificates; false - don't trust.

trust.self-signed = true user = xxx password = xxx

jimmie commented 2 years ago

We need to have pdscloud-prod1 and pdscloud-prod2 added to the OpenSearch whitelist. I will file a ticket to have that done.

rchenatjpl commented 2 years ago

oh, thank goodness, i assumed it was user error

c-suh commented 2 years ago

@rchenatjpl, @tloubrieu-jpl, and @viviant100, it sounds like the EN-specific documentation wasn't looked at, which means that (1) the java version isn't set and (2) the installation isn't in the new deployment directory we had the SAs set up. Should I address these or work with @rchenatjpl on these? Also confirming that this should be on only one of the production machines and not both?

rchenatjpl commented 2 years ago

Did I miss steps? That's entirely possible. It's hard for me to chase down links and remember where I am.

tloubrieu-jpl commented 2 years ago

@rchenatjpl I think you need to:

@c-suh how popular is 'jenv' ? Is it widely used in Joel's environments ? I never used it and was happy with setting JAVA_HOME. If we use it, which I am not against, we need to document that and make it a standard in our deployments because I would worry if sometimes we set the java_home manually, sometimes we use Jen. That can become messy.

Thanks

c-suh commented 2 years ago

@tloubrieu-jpl pardon, "Joel's environments"? jenv seems to be used fairly commonly to manage multiple java versions, and I think it's much friendlier than alternatives or sdkman, because you can set local environments once (like with pyenv) and forget about it, opposed to having to manually switch between versions. However, I had done this (installed a version manager rather than switching entirely to version 11), because I assumed that upgrading the entire system to jdk11 would have broken some of our older tools. Apologies for not posing this as a question or making it more obvious at the end of my comment above. So, should we upgrade entirely to jdk11 or use a java version manager?

Additionally, I did not see anything in the public documentation about configuring any endpoints, but is there anything we need to let the SAs know to use the new versions of these tools? There was the related question in the Slack channel regarding the new deployment directory (and possible standardized procedure), because currently, any upgrades to these tools requires letting the SAs know to re-point to the new versions.

As for documentation, regarding setting the java version locally, I included this as step 2 of "Installation" for both Standalone Harvest and Registry Manager. Regarding the initial installation and setup of jenv, I kept personal notes but did not post it in the internal wiki. Should I create a page for now under the Software page, and it can be moved later to a page or section that is not for PDS-specific software?

@rchenatjpl yup, I linked the documentation in Slack channel, but I see now that it's hardly visible amidst all the other blue of the mentions. Too many conveniences cluttered together. Here is the internal documentation for Standalone Harvest and Registry Manager. These can also be gotten to from the Registry section of Software Installation and Deployment Guides

c-suh commented 2 years ago

@tloubrieu-jpl and @viviant100 as suggested then agreed upon, I have removed jenv and instead inserted a line to call another script which has the jdk11 path. This has been tested on gamma and the documentation has been updated for when Richard installs this on production. I've noticed that the pds account on gamma has a pds-registry-app-1.0.2 which contains harvest-3.5.1 and registry-manager-4.3.0. Should this be left alone or deleted?

viviant100 commented 2 years ago

I would suggest to delete it if it's not in use.

c-suh commented 2 years ago

@rchenatjpl I saw your comment on the internal Registry Manager wiki page, and it sounds like it installed successfully! Please confirm if

  1. standalone harvest is installed on cloud prod1
  2. registry manager is installed on cloud prod1
  3. anything else?
rchenatjpl commented 2 years ago

@c-suh @tloubrieu-jpl @jimmie I installed registry-manager and harvest, but the next question of course is how to hook them up. Is there a way to check if registry-manager is already running? If I should run registry-manager on prod1 myself, is there a way to check if OpenSearch is running? So I tried:

[pds4@pdscloud-prod1 test]$ registry-manager create-registry
Elasticsearch URL: http://localhost:9200
Creating index...
   Index: registry
  Schema: /usr/local/applications/registry-manager/elastic/registry.json
  Shards: 1
Replicas: 0
[ERROR] Connection refused

Hopefully that means registry-manager is running already on the machine I'm supposed to connect to. Is that localhost:9200 or the long amazonaws URL that Jimmie sent a while ago? I tried the latter. In ~pds4/test/directories.xml:

and that auth file does exist and does hold what Jimmie sent. I did not change the password - I just want stuff to work once before doing so. Anyway,

[pds4@pdscloud-prod1 test]$ harvest -c directories.xml 
[SUMMARY] Reading configuration from /data/home/pds4/test/directories.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry
[INFO] Connecting to Elasticsearch
[ERROR] method [GET], host [https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443], URI [/registry/_mappings], status line [HTTP/1.1 403 Forbidden]
{"Message":"User: anonymous is not authorized to perform: es:ESHttpGet because no resource-based policy allows the es:ESHttpGet action"}

And for good measure I tried localhost:9200

[pds4@pdscloud-prod1 test]$ diff directories.xml dir2.xml
22,23c22,23
<   <!--registry url="http://localhost:9200" index="registry" /-->
<   <registry url="https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443" index="registry" auth="/home/pds4/test/auth.txt" />
---
>   <registry url="http://localhost:9200" index="registry" />
>   <!--registry url="https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443" index="registry" auth="/home/pds4/test/auth.txt" /-->
[pds4@pdscloud-prod1 test]$ harvest -c dir2.xml
[SUMMARY] Reading configuration from /data/home/pds4/test/dir2.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: http://localhost:9200, index: registry
[INFO] Connecting to Elasticsearch
[ERROR] Connection refused

I tried yet another version with the auth file attached to localhost, but same thing. I'll keep plugging away, but hopefully someone can spot the problem easily. Thanks

jordanpadams commented 2 years ago

@rchenatjpl

Is there a way to check if registry-manager is already running?

As a general idea, registry-manager is just a command-line tool for interacting with OpenSearch (aka the Registry). It is a command-line tool, similar to validate or the legacy search-core. So it just runs when you tell it to run. That being said, if you want to check if you or someone else is running it right now:

ps aux | egrep registry-manager

Is that localhost:9200 or the long amazonaws URL that Jimmie sent a while ago?

I believe it is the long URL, but I will leave that to @jimmie

jimmie commented 2 years ago

@rchenatjpl - the admins have enabled access to en-prod. I need to test. If that works, they will enable it for all of the other OpenSearch domains.

To answer your question regarding accessing Opensearch, you will need to use the URI: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com/

jimmie commented 2 years ago

I have verified that I can access en-prod from pdscloud-prod1 and pdscloud-prod2. @rchenatjpl - you should now be unblocked in this respect.

rchenatjpl commented 2 years ago

Thanks, @jimmie OK, am I doing something wrong? I think I ingested, but I see nothing in OpenSearch. Here's the URL I used, and I did type in my username/password https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com/registry/_search?q=*&amp;pretty and the web page says "message: null" I thought populating the registry had worked, but maybe not. Here's that output. Thanks,

% harvest -c directories.xml 
[SUMMARY] Reading configuration from /data/home/pds4/test/directories.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /usr/local/applications/harvest/elastic/data-dic-types.cfg
[INFO] Processing directory: /home/rchen/testdata
[INFO] Processing /home/rchen/testdata/mission.apollo_11_1.0.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.JSON to /tmp/LDD-1122365850083766232.JSON
Jul 07, 2022 11:13:47 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=jxy0hDiJzAnA6jOK6Ms79duG82lRLHHM1yJp9atrIYZGatIWZj79kEae5PgMPoZdJSMZtD0CCNljnniiCOKIyw29nsGdPPv8P/jt9MwoQN+TaNSacoLEVuEIGeTi; Expires=Thu, 14 Jul 2022 18:13:47 GMT; Path=/". Invalid 'expires' attribute: Thu, 14 Jul 2022 18:13:47 GMT
Jul 07, 2022 11:13:47 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=jxy0hDiJzAnA6jOK6Ms79duG82lRLHHM1yJp9atrIYZGatIWZj79kEae5PgMPoZdJSMZtD0CCNljnniiCOKIyw29nsGdPPv8P/jt9MwoQN+TaNSacoLEVuEIGeTi; Expires=Thu, 14 Jul 2022 18:13:47 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Thu, 14 Jul 2022 18:13:47 GMT
[INFO] Creating temporary ES data file /tmp/es-10993609303684127597.json
[INFO] Loading ES data file: /tmp/es-10993609303684127597.json
[INFO] Loaded 500 document(s)
[INFO] Loaded 1000 document(s)
[INFO] Loaded 1326 document(s)
[INFO] Updating Elasticsearch schema.
[INFO] Updated 8 fields
[INFO] Processing /home/rchen/testdata/orexsmall/naf018_sff/collection_radioscience_naf018_sff.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating Elasticsearch schema.
[INFO] Updated 4 fields
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing /home/rchen/testdata/orexsmall/naf018_sff/cruise/orx_r_160928_161002_v01.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating 'orex' LDD. Schema location: https://pds.nasa.gov/pds4/mission/orex/v1/orex_ldd_OREX_1400.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/mission/orex/v1/orex_ldd_OREX_1400.JSON to /tmp/LDD-658827965114614068.JSON
Jul 07, 2022 11:14:02 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=pfsB5Ur8aknV2/3HqQM8HNTdJq3I51F49smNs0IhitpX/n8NHnYBguI1eZumc2zOMWQypX3FWUjUfxzrORfzvShJklE7UpWg51+mlgVSv95P8SxEex8UuGbd+UOQ; Expires=Thu, 14 Jul 2022 18:14:02 GMT; Path=/". Invalid 'expires' attribute: Thu, 14 Jul 2022 18:14:02 GMT
Jul 07, 2022 11:14:02 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=pfsB5Ur8aknV2/3HqQM8HNTdJq3I51F49smNs0IhitpX/n8NHnYBguI1eZumc2zOMWQypX3FWUjUfxzrORfzvShJklE7UpWg51+mlgVSv95P8SxEex8UuGbd+UOQ; Expires=Thu, 14 Jul 2022 18:14:02 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Thu, 14 Jul 2022 18:14:02 GMT
[INFO] Creating temporary ES data file /tmp/es-9623659808871018744.json
[INFO] Loading ES data file: /tmp/es-9623659808871018744.json
[INFO] Loaded 285 document(s)
[INFO] Updating Elasticsearch schema.
[INFO] Updated 15 fields
[INFO] Processing /home/rchen/testdata/orexsmall/naf018_sff/cruise/orx_r_160919_160922_v01.xml
[INFO] Processing /home/rchen/testdata/orexsmall/naf018_sff/cruise/orx_r_160909_160913_v01.xml
[INFO] Processing /home/rchen/testdata/orexsmall/naf018_sff/cruise/orx_r_160929_161006_v01.xml
[INFO] Processing /home/rchen/testdata/orexsmall/naf018_sff/cruise/orx_r_160922_160929_v01.xml
[INFO] Processing /home/rchen/testdata/orexsmall/naf018_sff/cruise/orx_r_160915_160919_v01.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2021_114_2021_118_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2018_306_2018_334_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/collection_trk223_ion_vlbi.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating Elasticsearch schema.
[INFO] Updated 2 fields
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2019_060_2019_064_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2018_244_2018_273_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2019_259_2019_264_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2020_129_2020_131_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2018_336_2018_364_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2019_059_2019_059_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2020_030_2020_030_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2020_032_2020_055_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2018_274_2018_304_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2018_231_2018_242_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2020_259_2020_270_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2019_001_2019_002_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2019_159_2019_162_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2020_214_2020_214_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_vlbi/orex_beno_2021_121_2021_150_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_214_2020_245_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_060_2019_091_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_152_2019_182_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2021_032_2021_060_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_336_2021_001_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_244_2019_274_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2021_001_2021_032_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_305_2019_335_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_335_2020_001_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_306_2020_336_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_213_2019_244_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_274_2019_305_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2021_060_2021_091_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_032_2019_060_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_001_2019_032_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2018_274_2018_305_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_091_2019_121_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_182_2019_213_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_122_2020_153_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_092_2020_122_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2021_121_2021_152_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_061_2020_092_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2018_335_2019_001_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2018_213_2018_244_ion.xml
[INFO] Wrote 50 product(s)
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/collection_trk223_ion_dopr.xml
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2018_305_2018_335_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_183_2020_214_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_032_2020_061_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_245_2020_275_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2021_091_2021_121_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_275_2020_306_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2019_121_2019_152_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_001_2020_032_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2018_244_2018_274_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk223_ion_dopr/orex_beno_2020_153_2020_183_ion.xml
[INFO] Processing /home/rchen/testdata/orexsmall/document/SIS_NAF018_ORX-SFF_CCv0001.xml
[ERROR] Data file /home/rchen/testdata/orexsmall/document/SIS_NAF018_ORX-SFF_CCv0001.pdf doesn't exist
[INFO] Processing /home/rchen/testdata/orexsmall/document/collection_radioscience_document.xml
[INFO] Wrote 2 collection inventory document(s)
[INFO] Processing /home/rchen/testdata/orexsmall/document/spacecraft_mass_history.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating Elasticsearch schema.
[INFO] Updated 8 fields
[INFO] Processing /home/rchen/testdata/orexsmall/document/antenna_swap_history.xml
[INFO] Processing /home/rchen/testdata/orexsmall/document/radioscience_bundle_information.xml
[ERROR] Data file /home/rchen/testdata/orexsmall/document/radioscience_bundle_information.pdf doesn't exist
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/collection_radioscience_trk234_traknav.xml
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_259_134529_2016_259_221501_25.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating 'orex' LDD. Schema location: https://pds.nasa.gov/pds4/mission/orex/v1/orex_ldd_OREX_1400.xsd
[INFO] This LDD already loaded.
[INFO] Updating Elasticsearch schema.
[INFO] Updated 17 fields
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_258_061550_2016_258_151500_65.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_255_224030_2016_256_075000_35.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_256_140107_2016_256_220001_25.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_254_205513_2016_255_080000_35.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_257_134558_2016_257_215501_25.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_258_135552_2016_258_213000_26.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_259_062111_2016_259_151000_55.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_255_141049_2016_256_001000_26.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_256_204012_2016_257_075500_35.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_257_063552_2016_257_150500_65.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_258_201047_2016_259_074000_45.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_255_063030_2016_255_154000_54.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_257_203533_2016_258_072500_35.xml
[INFO] Processing /home/rchen/testdata/orexsmall/trk234_trknav/cruise/orex_beno_2016_256_063028_2016_256_152452_54.xml
[INFO] Processing /home/rchen/testdata/orexsmall/bundle_orex_radioscience.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating Elasticsearch schema.
[INFO] Updated 2 fields
[INFO] Wrote 81 product(s)
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 81
[SUMMARY]   Product_Bundle: 1
[SUMMARY]   Product_Collection: 5
[SUMMARY]   Product_Context: 1
[SUMMARY]   Product_Document: 2
[SUMMARY]   Product_Observational: 72
[SUMMARY] Failed files: 2
[SUMMARY] Package ID: 5586893d-14aa-45c2-982e-ec257bff89ee
% 
% 
% 
% cat directories.xml 
<?xml version="1.0" encoding="UTF-8"?>

<!--
  * !!! 'nodeName' is a required attribute. !!!
  * Use one of the following values:
  *     PDS_ATM  - Planetary Data System: Atmospheres Node
  *     PDS_ENG  - Planetary Data System: Engineering Node
  *     PDS_GEO  - Planetary Data System: Geosciences Node
  *     PDS_IMG  - Planetary Data System: Imaging Node
  *     PDS_NAIF - Planetary Data System: NAIF Node
  *     PDS_PPI  - Planetary Data System: Planetary Plasma Interactions Node
  *     PDS_RMS  - Planetary Data System: Rings Node
  *     PDS_SBN  - Planetary Data System: Small Bodies Node at University of Maryland
  *     PSA      - Planetary Science Archive
  *     JAXA     - Japan Aerospace Exploration Agency
-->
<harvest nodeName="PDS_ENG">

  <!-- Registry configuration -->
  <!-- UPDATE with your registry information -->
  <!--registry url="http://localhost:9200" index="registry" auth="/home/pds4/test/auth.txt" /-->
  <!--registry url="http://localhost:9200" index="registry" /-->
  <registry url="https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443" index="registry" auth="/home/pds4/test/auth.txt" />

  <directories>
    <!-- Path to one or more directories with PDS4 labels -->
    <path>/home/rchen/testdata</path>
  </directories>

  <!-- 
      NOTE: By default only lid, vid, lidvid, title and product class are exported.
      autogenFields should also be enabled for operational ingestion.

      See documentation for more configuration options: https://nasa-pds.github.io/pds-registry-app/operate/harvest.html
  -->
  <fileInfo processDataFiles="true" storeLabels="true">
    <!-- UPDATE with your own local path and base url where pds4 archive are published -->
    <fileRef replacePrefix="/path/to/archive" with="https://url/to/archive/" />
  </fileInfo>

  <!-- 
     Extract all fields. Field names: <namespace>:<class_name>/<namespace>:<attribute_name>
     NOTE: This should only be disabled for testing purposes
  -->
  <autogenFields/>

</harvest>
rchenatjpl commented 2 years ago

@jimmie @tloubrieu-jpl @c-suh Regarding the last message, I'm hoping a 2-minute glance will show something obvious to one of you. If not, I'll muck around. Thanks

jimmie commented 2 years ago

@rchenatjpl - how are you querying Opensearch? I'm seeing documents in there.

curl -u pdsadmin -X GET "https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com/registry/_count"

{"count":106,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}

rchenatjpl commented 2 years ago

I'm using a browser with URL https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com/registry/_search?q=*&amp;pretty

Hmm, I guess I have the tail end of that URL wrong. I'll search. Thanks

jimmie commented 2 years ago

No, that works for me. What login are you using?

rchenatjpl commented 2 years ago

No login. Oh, I replaced the escaped & with &, and now I see "hits". Thanks, @jimmie

rchenatjpl commented 2 years ago

@tloubrieu-jpl I'm getting back into this, and the configuration confuses me again. From pdscloud-prod1, harvest's config file points to registry url="https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443" index="registry" auth="/home/pds4/test/auth.txt" When I re-harvest something I harvested a month ago, I correctly get many error messages 'Skipping registered product...', e.g. [WARN] Skipping registered product urn:nasa:pds:orex.radioscience::1.0

Now, how do I manipulate the registry? And where? I tried

  % curl --get 'https://pds.nasa.gov/api/search/1.0/products/urn:nasa:pds:orex.radioscience::1.0' --header 'Accept: application/json'
  {"request":"/products/urn:nasa:pds:orex.radioscience::1.0","message":"The lidvid urn:nasa:pds:orex.radioscience was not found"}

So I replaced https://pds.nasa.gov with the URL in the harvest config file

  % curl --get 'https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443/api/search/1.0/products/urn:nasa:pds:orex.radioscience::1.0' --header 'Accept: application/json'
  {"error":"no handler found for uri [/api/search/1.0/products/urn:nasa:pds:orex.radioscience::1.0] and method [GET]"}

I'm going to want to view and delete entries. I should be using these curl commands? Should I also be able to do so using registry-manager commands? If so, how do I point registry-manager to the OpenSearch that harvest uses? Thanks

viviant100 commented 2 years ago

redirect above inquiry to @jimmie

jimmie commented 2 years ago

You were correct in sending the api/search/1.0 to pds.nasa.gov - Opensearch itself (search-en-prod*) doesn't understand that format. I imagine the LIDVID is not found because the archive_status has not been promoted to 'archived' which you need to do using registry-manager.

Registry-manager should use the same endpoint that you used in Harvest and is specified on the command line using the -es switch. Run registry-manager -help to see the available command line options.

rchenatjpl commented 2 years ago

@jimmie Is https://pds.nasa.gov/api/search/ accessing other registries as well as the one I'm mucking with? I ingested LIDs urn:nasa:pds:asteroid_polarimetric_database*, and before promoting them to 'archived', they already showed up in the URL above. And if I delete all entries from my registry, those LIDs still show, even in a private browser.

jimmie commented 2 years ago

@rchenatjpl - yes, that endpoint accesses all of the registries via Opensearch cross-cluster search (CCS). Unfortunately, with this architecture there is no way to isolate EN since EN's Opensearch is the one with CCS enabled.

rchenatjpl commented 2 years ago

No problem, thanks, I'm just trying to account for what I'm seeing. @jordanpadams I'm going to create 1 more fake bundle with slightly weird stuff, then I'll call my stuff done unless you want more. ETA tonight. I would use real bundles, but the ones that are workable are already registered.

jordanpadams commented 2 years ago

@rchenatjpl that sounds great! once you get that test data, can you hand it over to @c-suh so she can upload it here: https://pds.nasa.gov/data/pds4/test-data/custom-datasets/

rchenatjpl commented 2 years ago

@c-suh Sorry to stick you with this. pdscloud-prod1://home/pds4/contextSubset.tgz has 2 similar toy bundles, both with PDS4 context products, a bundle, collections, and documents, as requested at the top of this issue. The tester should harvest the first bundle, check that OpenSearch does not have the LIDs from the second bundle not in the first bundle, harvest the second bundle, see those LIDs.

rchenatjpl commented 2 years ago

I just noticed the PDS3 request. I don't know the innards of how a PDS3 catalog file gets into the current database. I run 'catalog -mingest' to do that. Those calls operationally are buried within ~pds4/catalog/*/update-procedure.txt, which are scripts despite the file extension.

I did successfully ingest /data/pds4/context-pds3, the context products equivalent to the PDS3 catalog files, but that may not be what you want.

c-suh commented 2 years ago

@rchenatjpl, @jordanpadams, and @viviant100 sorry in turn, but it's unclear to me what I'm supposed to do.

jordanpadams commented 2 years ago

@rchenatjpl @c-suh sorry for the confusion here. I lost track of this ticket and it seems like we are headed down a rabbit hole we really didn't want to go.

the premise of this ticket and it's parent was to deploy all the tools and actually ingest all of the EN-managed data into the registry. not to test the registry beyond a very simple initial smoke test of "can I communicate with and ingest something". From the comments above, it looks like the smoke test is completed, so the final steps should be for us to get all this data ingested and update our procedures in the future to include this ingestion.

@rchenatjpl can we please:

rchenatjpl commented 2 years ago

Ignore this now, as Jordan and I were typing simultaneously

@c-suh oh my god, I named it something else, which I renamed it to contextSubset.tgz. Sorry about that. I don't know what's planned for that. Is there value in doing it again now?

jordanpadams commented 2 years ago

@rchenatjpl sorry. just realized you said this:

successfully ingest /data/pds4/context-pds3

great! that is good enough for now. we definitely need to figure out where all that catalog ingestion data goes. especially for updates to the data sets / PDS3 context products. i will create another ticket to investigate there.

jordanpadams commented 2 years ago

also, per @jimmie's comment above for searching the registry for EN data, we should be able to query via our node_id, but I have had some trouble figuring out how to search that field (or any field names really) via a URL endpoint. will get back to you.

jordanpadams commented 2 years ago

@jimmie any ideas on how we could query the registry by node id?

jimmie commented 2 years ago

node is given by the term ops:Harvest_Info/ops:node_name. I thought the API provides the ability to add a 'query' as a URI query parameter that could include a specific value for node_name

jordanpadams commented 2 years ago

@jimmie copy. the PDS API definitely supports this, I just haven't quite been able to figure out the appropriate syntax/escaping in order to make the query work via curl. I will keep poking.

rchenatjpl commented 2 years ago

@jordanpadams @viviant100 Regarding "PDS3 Context Products" in the Description at the very top of this ticket, do you want the PDS3 context products ingested? Currently, they're hidden outside EN presumably because we don't want anyone to reference any of them. These are the ones that begin urn:nasa:pds:context_pds3:..., and they're sitting in /data/pds4/1700/PDS3_context_bundle_20161220/. Or do you have something else in mind for "PDS3 Context Products"? I'll assume I shouldn't ingest them, but if you want me to, I'll do it, though I'll have to create a bundle.xml and collection*.xml