atdrago / pod-monster

A front-end to the Podcast Index
https://pod.monster
7 stars 0 forks source link

Convert text/html transcripts into text/vtt (WebVTT) #179

Closed atdrago closed 2 years ago

atdrago commented 2 years ago

Problem

text/html can be any valid HTML document, but oftentimes it will have time codes embedded in plain text throughout the HTML.

Example that's entirely text with time codes https://share.transistor.fm/s/0ba4b425/transcript.txt:

[00:00:00] **Cassidy Williams:** Previously on Remotely Interesting... 

[00:00:03] **Jason Lengstorf:** For everyone who's not watching this live Tara just opened her mouth at the end of that joke. 

[00:00:10] **Cassidy Williams:** Hello and welcome to Remotely Interesting. 

Example that has nested HTML with time codes https://feeds.buzzsprout.com/1538779/10139859/transcript:

<body>
  <h1><!--block-->Podland 24/02</h1>
  <p><!--block-->[00:00:00]</p>
  <p><!--block-->[00:00:00] Welcome to Podland the last word in podcasting use. It's the 24th of February, 2022. I'm James. Cridlin the editor of pod news.net.</p>
  <p><!--block-->[00:00:08] And I'm Sam Sethi, the MD of river radio, the podcast, first radio station going live on dab on the 1st of March.</p>
</body>

Example that is all test with no time codes (can't do anything special with these) https://share.transistor.fm/s/c4ee7fb9/transcript.txt:

Jeremy:
Hi, everyone. I'm Jeremy Daly.

Rebecca:
And I'm Rebecca Marshburn.

Jeremy:
And this is Serverless Chats. Hey, Rebecca.

Solution

We should try to look for these time codes and create a valid text/vtt document from them. If for whatever reason that isn't possible, the transcript should be rendered as it is today.

djdmbrwsk commented 2 years ago

I was looking into this a bit last night. Do you have any idea how common this [00:00:00] format is? Coming in with no knowledge I'm not seeing much about a standard for TXT transcript formats. I do see some mentions of an SRT format and lots of services offering TXT exports too, but no clear indication of the format of TXT formats.

Do we know what the popular transcription services are today? Might be able to code for their export formats.