benwiley4000 / youtube-vtt

▶️ Extract and save WebVTT closed caption tracks from YouTube videos
MIT License
46 stars 13 forks source link

Speaker Identification/Recognition in youtube-vtt ? #10

Open scooter7 opened 2 years ago

scooter7 commented 2 years ago

Hi, Can this software identify multiple speakers and label them in the transcript output? Thanks!

benwiley4000 commented 2 years ago

Is this something supported by YouTube captions? Can you link to an example video?

scooter7 commented 2 years ago

I don't believe so. But, I'll keep digging. Thanks!

benwiley4000 commented 2 years ago

If there aren't labels in the YouTube captions I highly doubt we can do it here. But maybe there's hidden metadata in there?

scooter7 commented 2 years ago

Interesting...any ideas on how to determine the possibility of hidden metadata? Thanks!

benwiley4000 commented 2 years ago

Go to your YouTube video, open the JavaScript console and paste this:

console.log(ytplayer​.​config​.​args​.​raw_player_response​.​captions​)

You can explore what's in there and see if there's any info corresponding to speaker identification.

If not, then you're probably out of luck, sorry.

tomByrer commented 2 years ago

I don't think there is a standard. (edit: there is; see below) http203 & some news orgs will put each speaker in FULL-CAPS. I've also seen different colors & positions (left side & right side).

If you want to make a soft-standard, please let me know!

benwiley4000 commented 2 years ago

I believe what @tomByrer is talking about is putting SPEAKER NAME: at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question. I'm assuming the use case is for videos where you had no control over the creation of the captions but you want to download them.

scooter7 commented 2 years ago

Hi,

That's exactly right. I was wondering if it would be possible to distinguish between/among speakers automatically and to have those distinctions added to the transcripts in terms of "Speaker 1," "Speaker 2," etc.

Make sense?

Thanks,

James Vineburgh, Jr., PhD | 319-899-6620 [image: LinkedIn] http://www.linkedin.com/in/scootervineburgh

On Tue, May 17, 2022 at 8:23 AM Ben Wiley @.***> wrote:

I believe what @tomByrer https://github.com/tomByrer is talking about is putting SPEAKER NAME: at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 https://github.com/scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question.

— Reply to this email directly, view it on GitHub https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1128864993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4BFXALR2LJCCFUH7XFS3VKOMUDANCNFSM5OBDWF4Q . You are receiving this because you were mentioned.Message ID: @.***>

benwiley4000 commented 2 years ago

It does make sense although unless you're aware of some way YouTube already distinguishes between speakers in its UI, it's basically impossible to accomplish on our end. It would require fetching the audio, slicing up the audio based on caption timestamp, sending it to an external machine learning service that someone would have to pay for, possibly waiting awhile for a result, and then sending back the label, which doesn't even have a meaningful name yet.

But if you've seen some feature where, let's say, YouTube groups captions on either end of the screen based on who's speaking, then we might be able to get something.

Ben

Le mar. 17 mai 2022, 2 h 36 p.m., scooter7 @.***> a écrit :

Hi,

That's exactly right. I was wondering if it would be possible to distinguish between/among speakers automatically and to have those distinctions added to the transcripts in terms of "Speaker 1," "Speaker 2," etc.

Make sense?

Thanks,

James Vineburgh, Jr., PhD | 319-899-6620 [image: LinkedIn] http://www.linkedin.com/in/scootervineburgh

On Tue, May 17, 2022 at 8:23 AM Ben Wiley @.***> wrote:

I believe what @tomByrer https://github.com/tomByrer is talking about is putting SPEAKER NAME: at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 https://github.com/scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question.

— Reply to this email directly, view it on GitHub < https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1128864993 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAA4BFXALR2LJCCFUH7XFS3VKOMUDANCNFSM5OBDWF4Q

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1129193640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHOD3NEJEAK3DQE7QS76B3VKPRI7ANCNFSM5OBDWF4Q . You are receiving this because you commented.Message ID: @.***>

scooter7 commented 2 years ago

Got it. Thanks for clarifying!

On Tue, May 17, 2022, 6:44 PM Ben Wiley @.***> wrote:

It does make sense although unless you're aware of some way YouTube already distinguishes between speakers in its UI, it's basically impossible to accomplish on our end. It would require fetching the audio, slicing up the audio based on caption timestamp, sending it to an external machine learning service that someone would have to pay for, possibly waiting awhile for a result, and then sending back the label, which doesn't even have a meaningful name yet.

But if you've seen some feature where, let's say, YouTube groups captions on either end of the screen based on who's speaking, then we might be able to get something.

Ben

Le mar. 17 mai 2022, 2 h 36 p.m., scooter7 @.***> a écrit :

Hi,

That's exactly right. I was wondering if it would be possible to distinguish between/among speakers automatically and to have those distinctions added to the transcripts in terms of "Speaker 1," "Speaker 2," etc.

Make sense?

Thanks,

James Vineburgh, Jr., PhD | 319-899-6620 [image: LinkedIn] http://www.linkedin.com/in/scootervineburgh

On Tue, May 17, 2022 at 8:23 AM Ben Wiley @.***> wrote:

I believe what @tomByrer https://github.com/tomByrer is talking about is putting SPEAKER NAME: at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 https://github.com/scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question.

— Reply to this email directly, view it on GitHub <

https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1128864993

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAA4BFXALR2LJCCFUH7XFS3VKOMUDANCNFSM5OBDWF4Q

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1129193640 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ADHOD3NEJEAK3DQE7QS76B3VKPRI7ANCNFSM5OBDWF4Q

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1129423271, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4BFS43OGDADIH45ANQTLVKQVOHANCNFSM5OBDWF4Q . You are receiving this because you were mentioned.Message ID: @.***>

tomByrer commented 2 years ago

Actually, there is a standard for voices: https://w3c.github.io/webvtt/#example-03fc63a3

<v speaker>I am speaking & CSS to style would look like: :cue(v[voice="speaker"]) { color: cyan }

Unsure if it would work in YouTube or other video players, or if YouTube would strip.

I think there is a better format to upload than VTT for YouTube specifically, tested here: https://youtu.be/9W0Dy1nM-zU There is a Reddit thread about it.