Open scooter7 opened 2 years ago
Is this something supported by YouTube captions? Can you link to an example video?
I don't believe so. But, I'll keep digging. Thanks!
If there aren't labels in the YouTube captions I highly doubt we can do it here. But maybe there's hidden metadata in there?
Interesting...any ideas on how to determine the possibility of hidden metadata? Thanks!
Go to your YouTube video, open the JavaScript console and paste this:
console.log(ytplayer.config.args.raw_player_response.captions)
You can explore what's in there and see if there's any info corresponding to speaker identification.
If not, then you're probably out of luck, sorry.
I don't think there is a standard. (edit: there is; see below) http203 & some news orgs will put each speaker in FULL-CAPS. I've also seen different colors & positions (left side & right side).
If you want to make a soft-standard, please let me know!
I believe what @tomByrer is talking about is putting SPEAKER NAME:
at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question. I'm assuming the use case is for videos where you had no control over the creation of the captions but you want to download them.
Hi,
That's exactly right. I was wondering if it would be possible to distinguish between/among speakers automatically and to have those distinctions added to the transcripts in terms of "Speaker 1," "Speaker 2," etc.
Make sense?
Thanks,
James Vineburgh, Jr., PhD | 319-899-6620 [image: LinkedIn] http://www.linkedin.com/in/scootervineburgh
On Tue, May 17, 2022 at 8:23 AM Ben Wiley @.***> wrote:
I believe what @tomByrer https://github.com/tomByrer is talking about is putting SPEAKER NAME: at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 https://github.com/scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question.
— Reply to this email directly, view it on GitHub https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1128864993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4BFXALR2LJCCFUH7XFS3VKOMUDANCNFSM5OBDWF4Q . You are receiving this because you were mentioned.Message ID: @.***>
It does make sense although unless you're aware of some way YouTube already distinguishes between speakers in its UI, it's basically impossible to accomplish on our end. It would require fetching the audio, slicing up the audio based on caption timestamp, sending it to an external machine learning service that someone would have to pay for, possibly waiting awhile for a result, and then sending back the label, which doesn't even have a meaningful name yet.
But if you've seen some feature where, let's say, YouTube groups captions on either end of the screen based on who's speaking, then we might be able to get something.
Ben
Le mar. 17 mai 2022, 2 h 36 p.m., scooter7 @.***> a écrit :
Hi,
That's exactly right. I was wondering if it would be possible to distinguish between/among speakers automatically and to have those distinctions added to the transcripts in terms of "Speaker 1," "Speaker 2," etc.
Make sense?
Thanks,
James Vineburgh, Jr., PhD | 319-899-6620 [image: LinkedIn] http://www.linkedin.com/in/scootervineburgh
On Tue, May 17, 2022 at 8:23 AM Ben Wiley @.***> wrote:
I believe what @tomByrer https://github.com/tomByrer is talking about is putting SPEAKER NAME: at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 https://github.com/scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question.
— Reply to this email directly, view it on GitHub < https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1128864993 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAA4BFXALR2LJCCFUH7XFS3VKOMUDANCNFSM5OBDWF4Q
. You are receiving this because you were mentioned.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1129193640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHOD3NEJEAK3DQE7QS76B3VKPRI7ANCNFSM5OBDWF4Q . You are receiving this because you commented.Message ID: @.***>
Got it. Thanks for clarifying!
On Tue, May 17, 2022, 6:44 PM Ben Wiley @.***> wrote:
It does make sense although unless you're aware of some way YouTube already distinguishes between speakers in its UI, it's basically impossible to accomplish on our end. It would require fetching the audio, slicing up the audio based on caption timestamp, sending it to an external machine learning service that someone would have to pay for, possibly waiting awhile for a result, and then sending back the label, which doesn't even have a meaningful name yet.
But if you've seen some feature where, let's say, YouTube groups captions on either end of the screen based on who's speaking, then we might be able to get something.
Ben
Le mar. 17 mai 2022, 2 h 36 p.m., scooter7 @.***> a écrit :
Hi,
That's exactly right. I was wondering if it would be possible to distinguish between/among speakers automatically and to have those distinctions added to the transcripts in terms of "Speaker 1," "Speaker 2," etc.
Make sense?
Thanks,
James Vineburgh, Jr., PhD | 319-899-6620 [image: LinkedIn] http://www.linkedin.com/in/scootervineburgh
On Tue, May 17, 2022 at 8:23 AM Ben Wiley @.***> wrote:
I believe what @tomByrer https://github.com/tomByrer is talking about is putting SPEAKER NAME: at the beginning of the line every time a new speaker is talking. This works very well and I think is often used, but at this point I'm not sure what additional task youtube-vtt should perform, since the label is already there in the output. I thought @scooter7 https://github.com/scooter7 was asking if YouTube uses some unsupervised learning tech to identity unlabeled discrete speakers so their lines can be distinguished from one another, or perhaps I totally misunderstood the original question.
— Reply to this email directly, view it on GitHub <
https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1128864993
, or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AAA4BFXALR2LJCCFUH7XFS3VKOMUDANCNFSM5OBDWF4Q
. You are receiving this because you were mentioned.Message ID: @.***>
— Reply to this email directly, view it on GitHub < https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1129193640 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ADHOD3NEJEAK3DQE7QS76B3VKPRI7ANCNFSM5OBDWF4Q
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/benwiley4000/youtube-vtt/issues/10#issuecomment-1129423271, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4BFS43OGDADIH45ANQTLVKQVOHANCNFSM5OBDWF4Q . You are receiving this because you were mentioned.Message ID: @.***>
Actually, there is a standard for voices: https://w3c.github.io/webvtt/#example-03fc63a3
<v speaker>I am speaking
& CSS to style would look like: :cue(v[voice="speaker"]) { color: cyan }
Unsure if it would work in YouTube or other video players, or if YouTube would strip.
I think there is a better format to upload than VTT for YouTube specifically, tested here: https://youtu.be/9W0Dy1nM-zU There is a Reddit thread about it.
Hi, Can this software identify multiple speakers and label them in the transcript output? Thanks!