amnh-library / API-Portal

AMNH Library API
http://api-dev.library.amnh.org/api/v1/people
5 stars 0 forks source link

Weekly library data scraping #9

Open mik3caprio opened 7 years ago

mik3caprio commented 7 years ago

Set up cron in dev for scraping content - Scripts should fire off WEEKLY on weekends

mik3caprio commented 7 years ago

Now that the dev deployment is complete, we should make this the next thing to put in place. @pdelong42 would you like to take a crack at this? The requirements are basically to run 'python scrape.py' for each of the ElasticSearch indexes within crontab. I think the only modification required for each of the scraper scripts would be to check for an existing index first, remove it if it exists, then run the rest of the script as normal (I can add that code to the existing scripts, you just need to set up the crontab).

pdelong42 commented 7 years ago

@mik3caprio Sure, just give me the path to the scraping script, as well as the way it ought to be called, and I'll drop it into a crontab.

mik3caprio commented 7 years ago

So there are four sets of two scripts, one set for each Library system. The path is /home/apiproject/API-Portal/scrape/ and then the directories containing the Python scripts are dspace, omeka, sierra, and xeac. In each directory there is a scrape.py and a search.py. You would just need to run python scrape.py and python search.py for each Library system, and have them run weekly.

The only other thing in question is what we would do to delete the indexes from ElasticSearch before scraping and re-indexing. I'm assuming the cron would have another CLI command to remove ElasticSearch indices relating to the system first. In other words:

[ES CLIs to delete dspace* indices] python dspace/scrape.py python dspace/search.py

And so on.

I think we could/should set this cron up but not turn it on just yet.

mik3caprio commented 7 years ago

Hey @pdelong42 just confirm with me that you've got this set up and I'll close out this ticket.

pdelong42 commented 7 years ago

@mik3caprio, I tried running those scripts while logged-in as the "apiproject" user, but it threw some errors about missing python modules. Try it in dev to see what I mean.

Are these the same scripts that were used to populate the initial data set into Elasticsearch in the first place?

pdelong42 commented 7 years ago

Sorry, I closed it by mistake. Wrong button, oops...

mik3caprio commented 7 years ago

Ah yes excellent point... I ran them originally from my mcaprio account so I must have only installed modules there to run it! I'll get that sorted out.

Mike

On Wed, Jul 5, 2017 at 16:46 Paul DeLong notifications@github.com wrote:

@mik3caprio https://github.com/mik3caprio, I tried running those scripts while logged-in as the "apiproject" user, but it threw some errors about missing python modules. Try it in dev to see what I mean.

Are these the same scripts that were used to populate the initial data set into Elasticsearch in the first place?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amnh-library/API-Portal/issues/9#issuecomment-313221883, or mute the thread https://github.com/notifications/unsubscribe-auth/AAO8ho1lUMkN4AAsJLmFbjplhkaoMwZpks5sK_YggaJpZM4M0glQ .

--

Mike Caprio mik3cap@gmail.com https://brainewave.nyc/

pdelong42 commented 7 years ago

Okay, but let's install as many of these Python modules as RPMs, whenever they're available, and only grab from pip as needed.

Let me the names of the modules that are missing, and I'll make my best effort to find and install RPM packages of them from reputable sources.