WGBH-MLA / AAPB2

American Archive of Public Broadcasting
https://americanarchive.org/
Other
25 stars 9 forks source link

Add text versions of transcripts from Bill Moyers collection to AAPB #2757

Closed ekemeyer closed 1 month ago

ekemeyer commented 2 months ago

Details

Bill Moyers' team gave us (Miranda) hundreds of beautiful transcripts in a zip file. The transcripts lack any time codes, and are just Word or plain text.

In the AAPB meeting on April 30, Karen backed a decision to go ahead and post these transcripts on the public AAPB site instead of the time-synchronized machine-generated transcripts that are currently up.

This request is for Kevin to take the transcripts from Miranda's zip file, convert files to the appropriate plain text format, upload them to the right location in S3, update the transcript locations, and reindex those asset records.

Submitted by: Kevin Priority: Medium (within this month) URL: Slack message thread:

foo4thought commented 2 months ago
  1. processed DOCX into TXT
  2. reprocessed TXT provided to remove \x{0D} and other text insanity that chokes the ingest on AAPB
  3. narrowed scope of uploads to only assets currently utilizing transcript JSON
  4. collected all stats.txt files from affected assets on S3 (for later reporting fun)
  5. removed from S3 all ASR-generated objects for affected assets
  6. uploaded reprocessed TXT to S3 for affected assets
  7. reindexed AAPB for affected assets
  8. celebrated INFO [2024-05-02 19:12:23]: Starting one big commit... INFO [2024-05-02 19:12:29]: Finished one big commit. INFO [2024-05-02 19:12:29]: SUMMARY: DETAIL INFO [2024-05-02 19:12:29]: SUMMARY: STATS INFO [2024-05-02 19:12:29]: (Look just above for details on each error.) INFO [2024-05-02 19:12:29]: 582 (100.0%) succeeded INFO [2024-05-02 19:12:29]: DONE ############################ ENDING HOST 52.55.103.243 ############################
foo4thought commented 2 months ago

oop -

ekemeyer commented 1 month ago

Miranda has sent some redone transcripts - see email from May 20, 3:32 pm for zipped docx files.

ekemeyer commented 1 month ago

Done:
converted to TXT uploaded with backups to S3, metadata updated reindexed on AAPB updated all assignments in AAPB_Enhancements

Batch Ingest 3008

INFO [2024-05-21 19:09:47]: Starting one big commit... INFO [2024-05-21 19:09:47]: Finished one big commit. INFO [2024-05-21 19:09:47]: SUMMARY: DETAIL INFO [2024-05-21 19:09:47]: SUMMARY: STATS INFO [2024-05-21 19:09:47]: (Look just above for details on each error.) INFO [2024-05-21 19:09:47]: 42 (100.0%) succeeded INFO [2024-05-21 19:09:47]: DONE ############################ ENDING HOST 52.55.103.243 ############################