Closed tohuynh closed 4 years ago
This is bad on my part for not being more transparent about project planning. Our fork of PyWebVTT is exactly why we are trying to get this working so thank you for adding the issue here. (I will try to be better about issue tracking and project management in the future)
In short you totally get the steps needed to set this up with a few more considerations.
Step Comments:
Modify SeattleEventScraper
. Yep that's exactly correct. I am not personally seeing a caption URI on the onclick
portion of the tag. A link or more details on how you found that would be greatly appreciated.
Correct.
Totally correct. A new WebVTTConversionModel
(or similar name) would be needed. Side comment, the documentation page on transcript formats should be updated in general but especially after this.
Agree. However as closed caption files were only added in mid-2019 (and to make the system more robust for other cities as well), you would need a try-catch or an if-else. This would propagate a change to the pipeline configuration JSON files as well which would need to say something like:
"speech_recognition_model": {
"try": {
"module_path": "cdptools.sr_models.web_vtt_conversion_model",
"object_name": "WebVTTConversionModel",
"object_kwargs": {}
},
"catch": {
"module_path": "cdptools.sr_models.google_cloud_sr_model",
"object_name": "GoogleCloudSRModel",
"object_kwargs": {
"credentials_path": "/home/cdptools/credentials/cdp_seattle_cloud_platform_all.json"
}
}
}
Other general comments:
I don't personally like that closed caption files are all capitals, so some work to make them more "normal" would be great. I think the Python library nltk
has some solutions for that.
From @tohuynh:
The .vtt
file is also found in the onclick
but does not have a fully resolved URI.
@tohuynh how did you figure out the fuly resolved URI of in this case: https://seattlechannel.org/documents/seattlechannel/closedcaption/2019/brief_121619_2011995.vtt ?
Discovered for myself. In the script
for the banner video the more partially resolved URI exists. Example Banner Video Script:
var embedCode = '<iframe src="http://seattlechannel.org/embedvideoplayer?videoid=x109072" frameborder="0" scrolling="no" style="min-height: 270px; min-width: 480px"></iframe>';
var shareLink = "http://seattlechannel.org/FullCouncil?videoid=x109072";
try {
jwplayer('vidPlayer').remove();
}
catch(emsg){};
var playerInstance = jwplayer('vidPlayer');
playerInstance.setup({
sources: [
{
file: "//video.seattle.gov/media/council/council_121619_2021995V.mp4",
label: "Auto"
}
],
image: "images/seattlechannel/videos/2019/Q4/council_121619.jpg",
primary: "html5",
tracks: [{
file: "documents/seattlechannel/closedcaption/2019/council_121619_2021995.vtt",
label: "English",
kind: "captions",
"default": true
}
],
sharing: {
code: encodeURI(embedCode),
link: shareLink
},
ga: {
idstring:'City Council '
}
});
playerInstance.on('complete', function () {
$('#vidPlayer').append($('.overlayBox'));
});
playerInstance.on('beforePlay', function () {
});
playerInstance.on('play', function () {
$('.overlayBox').hide();
});
playerInstance.on('error', function (message) {
if (!$('img#videoError').length) {
$('#vidPlayer').after("<img id='videoError' class='img-responsive' src='images/seattlechannel/videoimages/channelGeneric.jpg' alt='There was an error' />");
}
else {
$('img#videoError').show();
}
$('#vidPlayer').hide();
});
playerInstance.on('setupError', function () {
if (!$('img#videoError').length) {
$('#vidPlayer').after("<img id='videoError' class='img-responsive' src='images/seattlechannel/videoimages/channelGeneric.jpg' alt='There was an error' />");
}
else {
$('img#videoError').show();
}
$('#vidPlayer').hide();
});
playerInstance.on('ready', function () {
$(".programImage .overlayBox").show();
$(".VideoComponent .overlayBox").show();
});
jwplayer('vidPlayer').addButton(
//This portion is what designates the graphic used for the button
"images/seattlechannel/download.png",
//This portion determines the text that appears as a tooltip
"Download Video",
//This portion designates the functionality of the button itself
function () {
//With the below code, we're grabbing the file that's currently playing
window.location.href = jwplayer('vidPlayer').getPlaylistItem()['file'];
},
//And finally, here we set the unique ID of the button itself.
'downloadvidPlayer'
);
$(".podcastContainer").hide();
Pulled From: Full Council
Yes, the more resolved caption URI is also there in script
.
But I thought I could just prepend https://seattlechannel.org/documents/seattlechannel/closedcaption/
to the found URI
I also don't like all caps caption. I think the NLP term is truecasing.
We could do something simple like sentence segmentation and Part-of-speech tagging from nltk. First letter of sentence and proper nouns are capitalized.
Or we could use Stanford CoreNLP which has a true casing annotator.
But I think we should separate True casing to another issue.
Oh totally agree on just prepending the https://seattlechannel.org/documents/seattlechannel/closedcaption/
to the URI I was just very confused as to how you found that prepend haha.
I would prefer to keep using nltk
instead introducing yet another heavy dependency. Here is a stackoverflow post on how to do it with nltk
. Link
I think the desire for our fork of webvtt
was to make a function that converts into our transcript format. But I think you are correct in that the vtt_parse_cdp
function (or w.e. it will be called) should simply live on the SRModel
instead of in our forked library. There was some starting work on that under the WebVTT.read_cdp
function.
I was able to find the full URI by looking under Network tab and then finding the vtt file.
Example of how to convert to potential format (from To), here. (Copy and paste that file into a .ipynb
file and then open in jupyter lab
)
Produced JSON:
{
"format": "timestamped-speaker-turns",
"annotations": [],
"confidence": 1,
"data": [
{
"start_time": 13.213,
"end_time": 14.948,
"text": "OKAY, GOOD MORNING."
},
{
"start_time": 14.948,
"end_time": 15.348,
"text": "GOOD MORNING."
},
{
"start_time": 15.348,
"end_time": 47.514,
"text": "THANKS FOR BEING HERE FOR OUR REGULAR SCHEDULED BRIEFING ON JULY 15. A FEW THINGS JUST TO MENTION BEFORE WE GO AROUND THE TABLE. WE WERE JOINED BY COUNCIL MEMBER BAGSHAW, PACHECO, JUAREZ, AND GONZALEZ. IF THERE'S NO OBJECTION TO THE MINUTES OF THE JULY 8, 2019 MEETING, IT'LL BE APPROVED. SEEING NO OBJECTIONS, THOSE MINUTES ARE BEING APPROVED. I JUST WANT TO MENTION -- I'M SORRY?"
},
{
"start_time": 47.514,
"end_time": 49.516,
"text": "I MAY HAVE AN OBJECTION."
}
]
}
@tohuynh A couple comments on the produced example JSON.
It would be great to keep sentences split. So something like:
{
"format": "timestamped-speaker-turns",
"annotations": [],
"confidence": 1,
"data": [
[
{
"start_time": 13.213,
"end_time": 14.948,
"text": "OKAY, GOOD MORNING."
}
],
[
{
"start_time": 14.948,
"end_time": 15.348,
"text": "GOOD MORNING."
}
],
[
{
"start_time": 15.348,
"end_time": 22.314,
"text": "THANKS FOR BEING HERE FOR OUR REGULAR SCHEDULED BRIEFING ON JULY 15."
},
{
"start_time": 23.542,
"end_time": 29.888,
"text": "WE WERE JOINED BY COUNCIL MEMBER BAGSHAW, PACHECO, JUAREZ, AND GONZALEZ."
},
{
"start_time": 31.133,
"end_time": 36.215,
"text": "IF THERE'S NO OBJECTION TO THE MINUTES OF THE JULY 8, 2019 MEETING, IT'LL BE APPROVED."
},
{
"start_time": 37.813,
"end_time": 42.418,
"text": "SEEING NO OBJECTIONS, THOSE MINUTES ARE BEING APPROVED."
},
{
"start_time": 43.926,
"end_time": 47.187,
"text": "I JUST WANT TO MENTION -- I'M SORRY?"
},
],
[
{
"start_time": 47.514,
"end_time": 49.516,
"text": "I MAY HAVE AN OBJECTION."
}
]
]
}
The above is semi-fake data, but the gist is basically, the data
portion of the transcript JSON becomes a list of lists where each list represents the sentences each speaker has said during their portion. This means we can still jump to more specific portions of the video regardless of speaker but we do have their speaker blocks as well.
Actually, thinking on this further, to future proof us to have a single timestamped-speaker-sentences
format. It would be a list of dictionaries that each have a data
block. This is to allow us to annotate speakers in the future.
{
"format": "timestamped-speaker-turns",
"annotations": [],
"confidence": 1,
"data": [
{
"speaker": "",
"data": [
{
"start_time": 13.213,
"end_time": 14.948,
"text": "OKAY, GOOD MORNING."
}
],
},
{
"speaker": "",
"data": [
{
"start_time": 14.948,
"end_time": 15.348,
"text": "GOOD MORNING."
}
],
},
{
"speaker": "",
"data": [
{
"start_time": 15.348,
"end_time": 22.314,
"text": "THANKS FOR BEING HERE FOR OUR REGULAR SCHEDULED BRIEFING ON JULY 15."
},
{
"start_time": 23.542,
"end_time": 29.888,
"text": "WE WERE JOINED BY COUNCIL MEMBER BAGSHAW, PACHECO, JUAREZ, AND GONZALEZ."
},
{
"start_time": 31.133,
"end_time": 36.215,
"text": "IF THERE'S NO OBJECTION TO THE MINUTES OF THE JULY 8, 2019 MEETING, IT'LL BE APPROVED."
},
{
"start_time": 37.813,
"end_time": 42.418,
"text": "SEEING NO OBJECTIONS, THOSE MINUTES ARE BEING APPROVED."
},
{
"start_time": 43.926,
"end_time": 47.187,
"text": "I JUST WANT TO MENTION -- I'M SORRY?"
},
],
},
{
"speaker": "",
"data": [
{
"start_time": 47.514,
"end_time": 49.516,
"text": "I MAY HAVE AN OBJECTION."
}
]
}
]
}
Use Case
It would be nice if our transcript format includes
timestamped-speaker-turns
where each timestamped item in transcript.data is a speaker turn.Solution
seattlechannel.org sometime provides vtt caption files that we can use to create timestamped speaker turns.
To do this we need to:
SeattleEventScraper
to get the vtt file. Specifically, information about the vtt file can be found in the same place where information about the video_uri is found. I'm looking at this particular line inSeattleEventScraper
:video = video_and_thumbnail.find("a").get("onclick")
https://seattlechannel.org/documents/seattlechannel/closedcaption/2019/brief_071519_2011955.vtt
SRModel
whose transcribe function will take the caption_uri and createtimestamped-speaker-turns
transcript. transcribe function will use webvtt-py to parse the vtt file and then leverage the fact that new speaker turn begins with>>
EventGatherPipeline
to use caption_uri and new SRModel to createtimestamped-speaker-turns
transcript, if caption_uri is valid. If caption_uri is invalid, use the other SRModel to create transcripts.