WGBH-MLA / mla-tickets

Internal tickets from MLA staff seeking assistance from software development and database staff
0 stars 0 forks source link

Peabody Review for Duplication: PBS, 70-89 #93

Open ekemeyer opened 2 weeks ago

ekemeyer commented 2 weeks ago

Details

Hi - I'm hoping I can get some help ID'ing items in an inventory we received from Peabody that are possibly either 1) already in the AAPB or 2) at the LOC. I've pulled out everything I was able to ID as already in the AAPB, as well as all WGBH programming. I've uploaded two inventories to a folder on the shared drive: the Peabody inventory UPDATED_PBS_in_Peabody_1970-1999_MKrev.xlsx) and the LOC's PBS-NET inventory (LC_NET-PBS_2in.xlsx). If you could ID possible matches, I'll then go in and do more manual checking. Let me know if you have any questions!

Submitted by: Michelle CC in communications: Priority: Medium (within this month) URL: https://drive.google.com/drive/folders/1GSmhnHrkkAfQQBdOenESGApYyVT086PD?usp=drive_link Slack message thread:

foo4thought commented 4 days ago

Ugh

foo4thought commented 4 days ago

I'm framing up comparison logic; I plan to read each title cell in the Peabody sheet, then process each text to:

  1. remove anything inside [] and ()
  2. use the characters for period and comma to denote paragraphs, (e.g., it seems to translate to a line for series and another for episode)
  3. search for each "title" line of the result within the column of LOC titles, ignoring case and reporting matches
  4. if none found, search and report again at word level (ignoring case)

Here is a list of unique words (after filtering already described); will you remove all but ones to IGNORE, please? i.e., I will search for all "titles" but not for words in the list to be ignored. words_unique.txt

foo4thought commented 4 days ago

Results illustrated in script editor; text values to be written to file :

Screenshot 2024-10-15 at 4 51 34 PM
foo4thought commented 2 days ago

ran script overnight, inspected output and killed it because iterator bug. Fixed that; re-arranged order of output columns for better VLOOKUP performance in Excel; added code to generate another file to report paired values of barcode and title word counts for use later in Excel (e.g., this a given Peabody barcode could match a given LOC title on 4/7 words, and/or filter matches by the decimal value of matches, etc)

foo4thought commented 1 day ago

killed script again because it was working at a problematic (slow) rate.
exported data from Excel to TSV text and rewrite everything to work in BASH. if this will be a regular need, I'll do it waaay more easily in FileMaker, I suspect. data will move back into Excel after new sheets are installed.