Fudge / infowars

Transcripts of the Alex Jones Show
14 stars 4 forks source link

No punctuations for some transcripts? #1

Open ReichYang opened 1 year ago

ReichYang commented 1 year ago

Hi, first of all, thank you for this amazing work!!! I'm about to do some analysis on this corpus but there is one issue: some transcripts do not have any punctuations, for example, 20150722_Wed_Alex.txt. And it seems to happen randomly? It might mess up some of my NLP pipelines. Do you have any insights into this issue, or if any amendments could be done? Thanks in advance!

milahu commented 12 months ago

youtube subtitles, generated by google speech, never have punctuations, and they have typos like #2

if you see punctuations, they come from manual postprocessing

ReichYang commented 12 months ago

youtube subtitles, generated by google speech, never have punctuations, and they have typos like #2

if you see punctuations, they come from manual postprocessing

I don't think they got the transcript from YouTube? I thought they used the Whisper AI for transcription. The issue is more of a tuning and the initialization of the model. See https://github.com/openai/whisper/discussions/194

Fudge commented 12 months ago

I'm using Whisper from OpenAI, yes. And it sometimes doesn't get things right. Not sure what to do about it.

ReichYang commented 12 months ago

I'm using Whisper from OpenAI, yes. And it sometimes doesn't get things right. Not sure what to do about it.

Hi, thanks for the reply. It looks like if one passes an initial prompt with punctuation it might fix it? (https://github.com/openai/whisper/discussions/194) But I'm not sure if that's doable for you or will cost too much

Fudge commented 11 months ago

If it's just a few episodes, I can re-do them. I've used a prompt with punctuation the last few months, so more recent stuff should be more reliable.

Fudge commented 11 months ago

I've started re-transcribing everything using a more recent and much faster Whisper implementation, letting me use the large-v2 model and an initial prompt with punctuation. It's going to take a bit over a month of uninterrupted work to get it done.

ReichYang commented 11 months ago

I've started re-transcribing everything using a more recent and much faster Whisper implementation, letting me use the large-v2 model and an initial prompt with punctuation. It's going to take a bit over a month of uninterrupted work to get it done.

Thank you so much! I was going to post here about the episodes that have missing punctuations but I somehow forgot. IIRC there are more than just a few and they appear randomly in terms of time so it was also hard to select a cut-off. I look forward to the new dataset!!!

ReichYang commented 9 months ago

@Fudge Hello, just wanna check in back on the issue. Based on the commits, it looks like all transcripts have been updated--is that the case?

Fudge commented 9 months ago

I've fed everything through WhisperX with punctuation prompts, so it's about as good as I can get it without manually editing.