Add timestamped speaker turns to transcript format

tohuynh commented 4 years ago

Use Case

It would be nice if our transcript format includes timestamped-speaker-turns where each timestamped item in transcript.data is a speaker turn.

Solution

seattlechannel.org sometime provides vtt caption files that we can use to create timestamped speaker turns.

To do this we need to:

Modify SeattleEventScraper to get the vtt file. Specifically, information about the vtt file can be found in the same place where information about the video_uri is found. I'm looking at this particular line in SeattleEventScraper: video = video_and_thumbnail.find("a").get("onclick")
Add a caption_uri field to each scraped event. For example, caption_uri could be: https://seattlechannel.org/documents/seattlechannel/closedcaption/2019/brief_071519_2011955.vtt
Add a new module that implements SRModel whose transcribe function will take the caption_uri and create timestamped-speaker-turns transcript. transcribe function will use webvtt-py to parse the vtt file and then leverage the fact that new speaker turn begins with >>
Modify EventGatherPipeline to use caption_uri and new SRModel to create timestamped-speaker-turns transcript, if caption_uri is valid. If caption_uri is invalid, use the other SRModel to create transcripts.

evamaxfield commented 4 years ago

This is bad on my part for not being more transparent about project planning. Our fork of PyWebVTT is exactly why we are trying to get this working so thank you for adding the issue here. (I will try to be better about issue tracking and project management in the future)

In short you totally get the steps needed to set this up with a few more considerations.

Step Comments:

Modify SeattleEventScraper. Yep that's exactly correct. I am not personally seeing a caption URI on the onclick portion of the tag. A link or more details on how you found that would be greatly appreciated.
Correct.
Totally correct. A new WebVTTConversionModel (or similar name) would be needed. Side comment, the documentation page on transcript formats should be updated in general but especially after this.

Agree. However as closed caption files were only added in mid-2019 (and to make the system more robust for other cities as well), you would need a try-catch or an if-else. This would propagate a change to the pipeline configuration JSON files as well which would need to say something like:

"speech_recognition_model": {
"try": {
    "module_path": "cdptools.sr_models.web_vtt_conversion_model",
    "object_name": "WebVTTConversionModel",
    "object_kwargs": {}  
},
"catch": {
    "module_path": "cdptools.sr_models.google_cloud_sr_model",
    "object_name": "GoogleCloudSRModel",
    "object_kwargs": {
        "credentials_path": "/home/cdptools/credentials/cdp_seattle_cloud_platform_all.json"
    }  
}
}

Other general comments: I don't personally like that closed caption files are all capitals, so some work to make them more "normal" would be great. I think the Python library nltk has some solutions for that.

evamaxfield commented 4 years ago

From @tohuynh:

The .vtt file is also found in the onclick but does not have a fully resolved URI.

@tohuynh how did you figure out the fuly resolved URI of in this case: https://seattlechannel.org/documents/seattlechannel/closedcaption/2019/brief_121619_2011995.vtt ?

evamaxfield commented 4 years ago

Discovered for myself. In the script for the banner video the more partially resolved URI exists. Example Banner Video Script:


               var embedCode = '<iframe src="http://seattlechannel.org/embedvideoplayer?videoid=x109072" frameborder="0" scrolling="no" style="min-height: 270px; min-width: 480px"></iframe>'; 
                 var shareLink = "http://seattlechannel.org/FullCouncil?videoid=x109072";            

            try {
                jwplayer('vidPlayer').remove();
            }
            catch(emsg){};

            var playerInstance = jwplayer('vidPlayer');
            playerInstance.setup({
            sources: [
                {
                    file: "//video.seattle.gov/media/council/council_121619_2021995V.mp4",
                    label: "Auto"
                }
            ],
            image: "images/seattlechannel/videos/2019/Q4/council_121619.jpg",
            primary: "html5",

                tracks: [{
                    file: "documents/seattlechannel/closedcaption/2019/council_121619_2021995.vtt",
                    label: "English",
                    kind: "captions",
                    "default": true
                }

                 ], 
                sharing: {
                        code: encodeURI(embedCode),
                        link: shareLink
                    },
                ga: {
                    idstring:'City Council '
                }
            });
            playerInstance.on('complete', function () {
                $('#vidPlayer').append($('.overlayBox'));
            });
            playerInstance.on('beforePlay', function () {

            });
            playerInstance.on('play', function () {
                $('.overlayBox').hide();
            });
            playerInstance.on('error', function (message) {
                if (!$('img#videoError').length) {
                    $('#vidPlayer').after("<img id='videoError' class='img-responsive' src='images/seattlechannel/videoimages/channelGeneric.jpg' alt='There was an error' />");
                }
                else {
                    $('img#videoError').show();
                }
                $('#vidPlayer').hide();

            });
            playerInstance.on('setupError', function () {
                if (!$('img#videoError').length) {
                    $('#vidPlayer').after("<img id='videoError' class='img-responsive' src='images/seattlechannel/videoimages/channelGeneric.jpg' alt='There was an error' />");
                }
                else {
                    $('img#videoError').show();
                }
                $('#vidPlayer').hide();
            });
            playerInstance.on('ready', function () {
                $(".programImage .overlayBox").show();
                $(".VideoComponent .overlayBox").show();
            });

            jwplayer('vidPlayer').addButton(
            //This portion is what designates the graphic used for the button
            "images/seattlechannel/download.png",
            //This portion determines the text that appears as a tooltip
            "Download Video",
            //This portion designates the functionality of the button itself
            function () {
                //With the below code, we're grabbing the file that's currently playing
                window.location.href = jwplayer('vidPlayer').getPlaylistItem()['file'];
            },
            //And finally, here we set the unique ID of the button itself.
              'downloadvidPlayer'

            );

            $(".podcastContainer").hide();

Pulled From: Full Council

tohuynh commented 4 years ago

Yes, the more resolved caption URI is also there in script.

But I thought I could just prepend https://seattlechannel.org/documents/seattlechannel/closedcaption/ to the found URI

I also don't like all caps caption. I think the NLP term is truecasing.

We could do something simple like sentence segmentation and Part-of-speech tagging from nltk. First letter of sentence and proper nouns are capitalized.

Or we could use Stanford CoreNLP which has a true casing annotator.

But I think we should separate True casing to another issue.

evamaxfield commented 4 years ago

Oh totally agree on just prepending the https://seattlechannel.org/documents/seattlechannel/closedcaption/ to the URI I was just very confused as to how you found that prepend haha.

I would prefer to keep using nltk instead introducing yet another heavy dependency. Here is a stackoverflow post on how to do it with nltk. Link

I think the desire for our fork of webvtt was to make a function that converts into our transcript format. But I think you are correct in that the vtt_parse_cdp function (or w.e. it will be called) should simply live on the SRModel instead of in our forked library. There was some starting work on that under the WebVTT.read_cdp function.

tohuynh commented 4 years ago

I was able to find the full URI by looking under Network tab and then finding the vtt file.

evamaxfield commented 4 years ago

Example of how to convert to potential format (from To), here. (Copy and paste that file into a .ipynb file and then open in jupyter lab)

Produced JSON:

{
  "format": "timestamped-speaker-turns",
  "annotations": [],
  "confidence": 1,
  "data": [
    {
      "start_time": 13.213,
      "end_time": 14.948,
      "text": "OKAY, GOOD MORNING."
    },
    {
      "start_time": 14.948,
      "end_time": 15.348,
      "text": "GOOD MORNING."
    },
    {
      "start_time": 15.348,
      "end_time": 47.514,
      "text": "THANKS FOR BEING HERE FOR OUR REGULAR SCHEDULED BRIEFING ON JULY 15. A FEW THINGS JUST TO MENTION BEFORE WE GO AROUND THE TABLE. WE WERE JOINED BY COUNCIL MEMBER BAGSHAW, PACHECO, JUAREZ, AND GONZALEZ. IF THERE'S NO OBJECTION TO THE MINUTES OF THE JULY 8, 2019 MEETING, IT'LL BE APPROVED. SEEING NO OBJECTIONS, THOSE MINUTES ARE BEING APPROVED. I JUST WANT TO MENTION  -- I'M SORRY?"
    },
    {
      "start_time": 47.514,
      "end_time": 49.516,
      "text": "I MAY HAVE AN OBJECTION."
    }
  ]
}

evamaxfield commented 4 years ago

@tohuynh A couple comments on the produced example JSON.

It would be great to keep sentences split. So something like:

{
  "format": "timestamped-speaker-turns",
  "annotations": [],
  "confidence": 1,
  "data": [
    [
        {
          "start_time": 13.213,
          "end_time": 14.948,
          "text": "OKAY, GOOD MORNING."
        }
    ],
    [
        {
          "start_time": 14.948,
          "end_time": 15.348,
          "text": "GOOD MORNING."
        }
    ],
    [
        {
          "start_time": 15.348,
          "end_time": 22.314,
          "text": "THANKS FOR BEING HERE FOR OUR REGULAR SCHEDULED BRIEFING ON JULY 15."
        },
        {
          "start_time": 23.542,
          "end_time": 29.888,
          "text": "WE WERE JOINED BY COUNCIL MEMBER BAGSHAW, PACHECO, JUAREZ, AND GONZALEZ."
        },
        {
          "start_time": 31.133,
          "end_time": 36.215,
          "text": "IF THERE'S NO OBJECTION TO THE MINUTES OF THE JULY 8, 2019 MEETING, IT'LL BE APPROVED."
        },
        {
          "start_time": 37.813,
          "end_time": 42.418,
          "text": "SEEING NO OBJECTIONS, THOSE MINUTES ARE BEING APPROVED."
        },
        {
          "start_time": 43.926,
          "end_time": 47.187,
          "text": "I JUST WANT TO MENTION -- I'M SORRY?"
        },
    ],
    [
        {
          "start_time": 47.514,
          "end_time": 49.516,
          "text": "I MAY HAVE AN OBJECTION."
        }
    ]
  ]
}

The above is semi-fake data, but the gist is basically, the data portion of the transcript JSON becomes a list of lists where each list represents the sentences each speaker has said during their portion. This means we can still jump to more specific portions of the video regardless of speaker but we do have their speaker blocks as well.

evamaxfield commented 4 years ago

Actually, thinking on this further, to future proof us to have a single timestamped-speaker-sentences format. It would be a list of dictionaries that each have a data block. This is to allow us to annotate speakers in the future.

{
  "format": "timestamped-speaker-turns",
  "annotations": [],
  "confidence": 1,
  "data": [
    {
      "speaker": "",
      "data": [
        {
          "start_time": 13.213,
          "end_time": 14.948,
          "text": "OKAY, GOOD MORNING."
        }
      ],
    },
    {
      "speaker": "",
      "data": [
        {
          "start_time": 14.948,
          "end_time": 15.348,
          "text": "GOOD MORNING."
        }
      ],
    },
    {
      "speaker": "",
      "data": [
        {
          "start_time": 15.348,
          "end_time": 22.314,
          "text": "THANKS FOR BEING HERE FOR OUR REGULAR SCHEDULED BRIEFING ON JULY 15."
        },
        {
          "start_time": 23.542,
          "end_time": 29.888,
          "text": "WE WERE JOINED BY COUNCIL MEMBER BAGSHAW, PACHECO, JUAREZ, AND GONZALEZ."
        },
        {
          "start_time": 31.133,
          "end_time": 36.215,
          "text": "IF THERE'S NO OBJECTION TO THE MINUTES OF THE JULY 8, 2019 MEETING, IT'LL BE APPROVED."
        },
        {
          "start_time": 37.813,
          "end_time": 42.418,
          "text": "SEEING NO OBJECTIONS, THOSE MINUTES ARE BEING APPROVED."
        },
        {
          "start_time": 43.926,
          "end_time": 47.187,
          "text": "I JUST WANT TO MENTION -- I'M SORRY?"
        },
      ],
    },
    {
      "speaker": "",
      "data": [
        {
          "start_time": 47.514,
          "end_time": 49.516,
          "text": "I MAY HAVE AN OBJECTION."
        }
      ]
    }
  ]
}

CouncilDataProject / cdptools_v2

Add timestamped speaker turns to transcript format #100

Use Case

Solution