NASA-PDS / operations

Tickets for the PDSEN Operations Team
Other
5 stars 1 forks source link

Develop script to enhance sitemap with data set landing page URLs #378

Closed jordanpadams closed 11 months ago

jordanpadams commented 1 year ago

💡 Description

Once https://github.com/NASA-PDS/portal-tasks/issues/65 is completed, we should augment the output file with URLs for all PDS data set landing pages

Super simple script to paginate through all the results from https://pds.nasa.gov/services/search/search?wt=json&q=product_class:Product_Collection%20OR%20product_class:Product_Bundle%20OR%20product_class:Product_Data_Set_PDS3%20OR%20product_class:Product_Document&fl=resLocation,modification_date&start=0&rows=1000

  <url>
    <loc>https://pds.nasa.gov/ds-view/pds/viewDocument.jsp?identifier=urn%3Anasa%3Apds%3Asystem_bundle%3Adocument_pds4_standards%3Adph_1.17.0&version=1.0</loc>
    <lastmod>2021-10-14</lastmod>
  </url>
  <url>
    <loc>https://pds.nasa.gov/ds-view/pds/viewDocument.jsp?identifier=urn%3Anasa%3Apds%3Amisc%3Adocument_cassini%3Apds3_titan_ion_ug_itar_mar2012&version=1.0</loc>
    <lastmod>2021-04-20</lastmod>
  </url>
  ...
c-suh commented 1 year ago

Script is written and working locally. Need to confirm details of

  1. where to put this script
  2. when to run it (frequency and actual timedate)
  3. where to output this partial XML file (e.g. should it be emailed?)

and then tested.

tloubrieu-jpl commented 1 year ago

We will have the script deployed on AWS gamma and one of the prodX EC2 instances.

Run it in a cronjob, we'll set the frequency, for example weekly. Beware Catherine experienced weird results when running the script on Frday at 6pm PDT.

The script should update an existing sitemap so that the output is a complete sitemap.

I propose to have the script managed in a specific repository or in a repository containing other script applied to the portal.

We will wait for @jordanpadams to validate that option.

tloubrieu-jpl commented 1 year ago

After breakout discussion today:

The script will be run manually when needed by the operation team, every month.

The script can be archived in the current repository and documented on the internal wiki.

c-suh commented 1 year ago

Run with a cron job* However, compare the new number of results to the current number of results. If new < current, postpone the job for 24 hours or some such.

Additionally, store a copy of the current sitemap alongside the script and use that as a base, otherwise will be appending this output to the previous output and so on.

jordanpadams commented 1 year ago

status: integration ongoing

jordanpadams commented 1 year ago

status: looking at adding GitHub integration portion and pds-github-util

c-suh commented 1 year ago

have github integration working locally (did not use pds-github-util). setting up virtual environment on gamma to test this.

tloubrieu-jpl commented 1 year ago

@c-suh is testing/deploying the script on pds-gamma

c-suh commented 1 year ago

looking into error with gitpython. Hopeful to get this done tomorrow but not optimistic.

nutjob4life commented 1 year ago

Question for everyone on this ticket: there's mention of "github integration" and in the current code on pdscloud-gamma I see a function do_git_stuff().

What's the goal of this? Do we want to check the sitemap.xml into a repository?

Also question for @c-suh: does this code exist only on pdscloud-gamma right now? Is there a separate repository for it? Should it be part of "operations"?

Thanks in advance!

nutjob4life commented 1 year ago

Update: currently blocked by DSIO-4051 (fixed!) DSIO-4059 (fixed!)

c-suh commented 1 year ago

@nutjob4life hi! Yes, the "github integration" and placeholder function do_git_stuff() is to check the updated sitemap.xml into the website repository. And yes, the code is currently only on pdscloud-gamma. There is no separate repository for it nor an intention to put into the "operations" repository; rather, it was going to be posted to this section along with any other documentation for this script (which I can do; just let me know).

nutjob4life commented 1 year ago

@c-suh okay couple more questions, bear with me! I'm trying to figure out some motivations here.

First: why does the sitemap.xml go into a git repository?

To me, this feels like committing the results of an ephemeral database query into something that's really meant for software or configuration, not data. Code changes deliberately over time, but data changes willy-nilly. I don't see the need to track this in git.

What about a mechanism like XInclude? The sitemap.xml could essentially be generated from a template (which is checked into git) which has a placeholder for "put the product URLs here". (Heck, sitemap.xml could be a named pipe that's generated on the fly! 😄)

Second question: since this script is indeed code, I think it ought to be in git (JPL Enterprise GitHub or public GitHub). What's the motivation for having it as an attachment on the wiki instead?

I apologize if these seem like dumb questions. The "operations" side of PDS-EN is new to me, so I appreciate your helping me figure these out!

Answers posted privately at https://jpl.slack.com/archives/D05E0RKEL8K/p1687799043772159 👍

Further background at: https://jpl.slack.com/archives/C05DH31C95E/p1686874407444169

nutjob4life commented 1 year ago

Okay @c-suh could you (and anyone else who's interested) could review the changes I made in do_git_stuff? I don't want to make a pull request since it's potentially sensitive. The file is on gamma in the expected location.

If that all looks okay, we can then attach the code to the wiki page.

c-suh commented 1 year ago

@nutjob4life looks great; ty! I edited the existing code to match a couple of your practices (e.g. the logger), and something went wonky with the existing venv, so I recreated it. If you would verify these, please, and then I'll clean up the old one, and we can finish up with documentation and the weekly cron job. If you've no preference, would you handle the latter (cron job), and I will handle the former (documentation)?

nutjob4life commented 1 year ago

@c-suh works great!

(base) $ date
Wed Jun 28 13:53:47 PDT 2023
(base) $ cd sitemap-script
(base) $ . bin/activate
(sitemap-script) (base) $ python main.py
INFO:__main__:=== Starting update of sitemap.xml with ds-view pages on Wed Jun 28, 2023 at 13:53:58
INFO:__main__:This run's numFound is 15993
INFO:__main__:Last run's numFound is 15993
INFO:__main__:Repository's sitemap hasn't changed since last run. Continuing from the existing local, partial sitemap.
INFO:__main__:Appending results to make a complete sitemap
INFO:__main__:Moving complete sitemap to repository
INFO:__main__:The sitemap.xml hasn't been updated this time, so no changes to commit
(sitemap-script) (base) $ echo 🎉
🎉
(sitemap-script) (base) $ 

I updated the pds4 crontab to run the script weekly.

c-suh commented 1 year ago

@nutjob4life ty! I tweaked the job to source the venv and to save the output. If you wouldn't mind reviewing the documentation once I have that done (likely tomorrow morning), I will let you know once it's ready.

nutjob4life commented 1 year ago

Looks fine. You don't need to activate the venv if you're using its python executable directly, but it doesn't hurt.

c-suh commented 12 months ago

Note: moved this to the other user account because it created an issue with pulling from the repo (since settled by the SAs here). Otherwise seems to be working well but will wait to see the next weekly run.

jordanpadams commented 11 months ago

to discuss with @c-suh about adding this script to the repo

nutjob4life commented 11 months ago

@c-suh @jordanpadams since there's sensitive info in this code, could we put it into a private repo? For more protection, it could be a private repo on Enterprise GitHub, rather than here. Thoughts?

jordanpadams commented 11 months ago

@nutjob4life @c-suh what kind of private information is this? and is there a way for us to use environment variables on the server instead for some of this information to avoid including it in the software?

c-suh commented 11 months ago

Addressing points from slack conversation and a few more (will also add these to the PR):