WGBH-MLA / mla-tickets

Internal tickets from MLA staff seeking assistance from software development and database staff
0 stars 0 forks source link

Another Drupal #59

Closed ekemeyer closed 1 month ago

ekemeyer commented 1 month ago

Details

Welcome back, Kevin! Hope you had a great vacation. Guess what? Another station with a Drupal database that can't export its catalog. :melting_face: This one should be a lot easier as it's not that many records and the people at the station are well-organized. As our resident expert on Drupal databases, would you be up for meeting with the team to brainstorm a solution please? One of the team member suggested using a webcrawler - but there has to be a better way.

Submitted by: Rochelle CC in communications: Priority: Medium (within this month) URL: Slack message thread: I have an email I can forward.

foo4thought commented 1 month ago

many emails and DM's later, I decided to pursue a strategy of capturing their metadata via screen scraping.

It's a Drupal site that failed to provide any expected JSON format output, but it did retain the basic /node/ structure so I used that to collect URLs to all their "articles" and then drilled into those to discern common structures. At the end, I used lynx dumps to scrape URLs and metadata to files in folders seen in the ZIP archive here. I explained all to Rochelle and provided the ZIP archive plus a list of 48 items that failed to yield expected results; she supposed that those items weren't yet digitized...

wpr.zip