Closed jordanpadams closed 11 months ago
Script is written and working locally. Need to confirm details of
and then tested.
We will have the script deployed on AWS gamma and one of the prodX EC2 instances.
Run it in a cronjob, we'll set the frequency, for example weekly. Beware Catherine experienced weird results when running the script on Frday at 6pm PDT.
The script should update an existing sitemap so that the output is a complete sitemap.
I propose to have the script managed in a specific repository or in a repository containing other script applied to the portal.
We will wait for @jordanpadams to validate that option.
After breakout discussion today:
The script will be run manually when needed by the operation team, every month.
The script can be archived in the current repository and documented on the internal wiki.
Run with a cron job* However, compare the new number of results to the current number of results. If new < current, postpone the job for 24 hours or some such.
Additionally, store a copy of the current sitemap alongside the script and use that as a base, otherwise will be appending this output to the previous output and so on.
status: integration ongoing
status: looking at adding GitHub integration portion and pds-github-util
have github integration working locally (did not use pds-github-util). setting up virtual environment on gamma to test this.
@c-suh is testing/deploying the script on pds-gamma
looking into error with gitpython. Hopeful to get this done tomorrow but not optimistic.
Question for everyone on this ticket: there's mention of "github integration" and in the current code on pdscloud-gamma I see a function do_git_stuff()
.
What's the goal of this? Do we want to check the sitemap.xml
into a repository?
Also question for @c-suh: does this code exist only on pdscloud-gamma right now? Is there a separate repository for it? Should it be part of "operations"?
Thanks in advance!
Update: currently blocked by DSIO-4051 (fixed!) DSIO-4059 (fixed!)
@nutjob4life hi! Yes, the "github integration" and placeholder function do_git_stuff()
is to check the updated sitemap.xml
into the website repository. And yes, the code is currently only on pdscloud-gamma. There is no separate repository for it nor an intention to put into the "operations" repository; rather, it was going to be posted to this section along with any other documentation for this script (which I can do; just let me know).
@c-suh okay couple more questions, bear with me! I'm trying to figure out some motivations here.
First: why does the sitemap.xml
go into a git repository?
To me, this feels like committing the results of an ephemeral database query into something that's really meant for software or configuration, not data. Code changes deliberately over time, but data changes willy-nilly. I don't see the need to track this in git.
What about a mechanism like XInclude? The sitemap.xml
could essentially be generated from a template (which is checked into git) which has a placeholder for "put the product URLs here". (Heck, sitemap.xml
could be a named pipe that's generated on the fly! 😄)
Second question: since this script is indeed code, I think it ought to be in git (JPL Enterprise GitHub or public GitHub). What's the motivation for having it as an attachment on the wiki instead?
I apologize if these seem like dumb questions. The "operations" side of PDS-EN is new to me, so I appreciate your helping me figure these out!
Answers posted privately at https://jpl.slack.com/archives/D05E0RKEL8K/p1687799043772159 👍
Further background at: https://jpl.slack.com/archives/C05DH31C95E/p1686874407444169
Okay @c-suh could you (and anyone else who's interested) could review the changes I made in do_git_stuff
? I don't want to make a pull request since it's potentially sensitive. The file is on gamma in the expected location.
If that all looks okay, we can then attach the code to the wiki page.
@nutjob4life looks great; ty! I edited the existing code to match a couple of your practices (e.g. the logger), and something went wonky with the existing venv, so I recreated it. If you would verify these, please, and then I'll clean up the old one, and we can finish up with documentation and the weekly cron job. If you've no preference, would you handle the latter (cron job), and I will handle the former (documentation)?
@c-suh works great!
(base) $ date
Wed Jun 28 13:53:47 PDT 2023
(base) $ cd sitemap-script
(base) $ . bin/activate
(sitemap-script) (base) $ python main.py
INFO:__main__:=== Starting update of sitemap.xml with ds-view pages on Wed Jun 28, 2023 at 13:53:58
INFO:__main__:This run's numFound is 15993
INFO:__main__:Last run's numFound is 15993
INFO:__main__:Repository's sitemap hasn't changed since last run. Continuing from the existing local, partial sitemap.
INFO:__main__:Appending results to make a complete sitemap
INFO:__main__:Moving complete sitemap to repository
INFO:__main__:The sitemap.xml hasn't been updated this time, so no changes to commit
(sitemap-script) (base) $ echo 🎉
🎉
(sitemap-script) (base) $
I updated the pds4
crontab to run the script weekly.
@nutjob4life ty! I tweaked the job to source the venv and to save the output. If you wouldn't mind reviewing the documentation once I have that done (likely tomorrow morning), I will let you know once it's ready.
Looks fine. You don't need to activate the venv if you're using its python
executable directly, but it doesn't hurt.
Note: moved this to the other user account because it created an issue with pulling from the repo (since settled by the SAs here). Otherwise seems to be working well but will wait to see the next weekly run.
to discuss with @c-suh about adding this script to the repo
@c-suh @jordanpadams since there's sensitive info in this code, could we put it into a private repo? For more protection, it could be a private repo on Enterprise GitHub, rather than here. Thoughts?
@nutjob4life @c-suh what kind of private information is this? and is there a way for us to use environment variables on the server instead for some of this information to avoid including it in the software?
Addressing points from slack conversation and a few more (will also add these to the PR):
💡 Description
Once https://github.com/NASA-PDS/portal-tasks/issues/65 is completed, we should augment the output file with URLs for all PDS data set landing pages
Super simple script to paginate through all the results from https://pds.nasa.gov/services/search/search?wt=json&q=product_class:Product_Collection%20OR%20product_class:Product_Bundle%20OR%20product_class:Product_Data_Set_PDS3%20OR%20product_class:Product_Document&fl=resLocation,modification_date&start=0&rows=1000
numFound
to keep track of total number of resultsstart
androws
to page through the resultsresLocation
for URL to data, use the latestmodification_date
(can be more than 1, and will require conversion to YYYY-MM-DD)