CDRH / chesnutt

Rails site code for the Charles W. Chesnutt Archive https://chesnuttarchive.org
0 stars 0 forks source link

pull rails view content into data_chesnutt repository #76

Open karindalziel opened 3 years ago

karindalziel commented 3 years ago

similar to what was done with family letters - this will get these items in the search.

I am marking this as "post launch" for now but we can move it up if need be.

karindalziel commented 2 years ago

Example of web scrape script:

https://github.com/CDRH/data_family_letters/blob/dev/scripts/overrides/webs_to_es.rb

Also see "scrape website" method https://github.com/CDRH/data_family_letters/blob/dev/scripts/overrides/data_manager.rb#L88

Th content is scraped into a folder in source like this: https://github.com/CDRH/data_family_letters/tree/dev/source/webs

There is some configuration in the public config https://github.com/CDRH/data_family_letters/blob/dev/config/public.yml#L9

What I can't figure out is how it knows what pages to scrape.

There is no documentation (or code?) for this in Datura yet, this is the only mention I can find of "scrape"

https://github.com/CDRH/datura/blob/29f79ef29e414637a7c38b2991f56129c63f13d3/lib/datura/data_manager.rb#L216

List of pages we will want to scrape for Chesnutt:

karindalziel commented 2 years ago

Update, found out how it knows what to scrape: https://cdrhdev1.unl.edu/family_letters/content_pages

Looking at this PR will probably be instructive

https://github.com/CDRH/data_family_letters/commit/784789966fa78e09a1baa37549fd511cc03e8729

Here's the controller that creates the page the scraping is populated from https://github.com/CDRH/family_letters/blob/4abe6198fd111f37da03e6b11d1c5abcee95add8/app/controllers/general_override.rb#L19