Parse closed caption files for Oakland better

evamaxfield commented 2 years ago

I fixed the cookiecutter / infra build and deployed Oakland!

There are still minor problems with permissions / CORS settings so ignore the broken video (I will fix that tomorrow)

That all said, Oakland has closed caption files which we parse and convert into our transcript format like we do for Seattle and Boston, once again, there is some minor differences from boston and seattle that I would love to be able to fix.

example of oakland event: https://councildataproject.org/oakland/#/events/e0912a619d69

the transcript includes a bunch of ">> " markings where there are new speakers. To me these should be filtered out properly.

Similarly there are also symbols I have never seen in a closed caption file like: "->> " which I assume also denote a new speaker.

cc @isaacna @Shak2000 @dphoria any of you free to take this one?

dphoria commented 2 years ago

I can work on this on weekends. Having said that, between work and school I have little free time, unfortunately. So @Shak2000 if you do have time and are willing, that would be better. I will wait until this weekend and then go from there.

Thank you Eva!

Shak2000 commented 2 years ago

Let me take a look

Shak2000 commented 2 years ago

I am not very familiar with transcripts, so I to first write a unit test for this case. To make this possible, could you please send me a link to examples of pages in the Oakland website as well as an example transcript or audio file?

evamaxfield commented 2 years ago

@Shak2000 example webpage with transcript: https://councildataproject.org/oakland/#/events/e0912a619d69

script for downloading oakland resources:

from cdp_backend.database import models as db_models
from cdp_backend.utils import file_utils
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

# Connect to the database
fireo.connection(client=Client(
    project="cdp-oakland-ba81c097",
    credentials=AnonymousCredentials()
))

# Get event
event = db_models.Event.collection.get("event/e0912a619d69")

# Get session
session = list(
    db_models.Session.collection.filter("event_ref", "==", event.key).fetch()
)[0]

# Get transcript
transcript = list(
    db_models.Transcript.collection.filter("session_ref", "==", session.key).fetch()
)[0]
transcript_file = transcript.file_ref.get()

# Connect to filestore
fs = GCSFileSystem(project="cdp-oakland-ba81c097", token="anon")

# Download transcript
fs.get(transcript_file.uri, "oakland-transcript.json")

# Download captions
file_utils.resource_copy(
    "http://oakland.granicus.com//videos/5042/captions.vtt",
    "oakland-captions.vtt",
    overwrite=True,
)

# If you want to read the transcript as a Python object
with open("oakland-transcript.json", "r") as open_f:
    read_transcript = Transcript.from_json(open_f.read())

print(read_transcript)

copy paste that into a file and call it with python and it will download two files, oakland-captions.vtt and oakland-transcript.json

Shak2000 commented 2 years ago

I managed to add the unit tests to the test_webvtt_sr_model.py, and it has run successfully

I can now debug the transcribe function on webvtt_sr_model.py. Before I debug, I would like to better understand why we implemented a different model. If there is any documentation, I would be happy to read

evamaxfield commented 2 years ago

Before I debug, I would like to better understand why we implemented a different model. If there is any documentation, I would be happy to read

I dont follow?

We haven't implemented a different model? / what model are you talking about?

Shak2000 commented 2 years ago

One format/model is that we get vtt from the municipality and we convert it into our own model. What is the logic behind this?

evamaxfield commented 2 years ago

For multiple reasons:

VTT files are for closed caption services, they are chunked up in sometimes really odd ways. Because we want full sentences we want to store the data in sentence format.
VTT files rarely have good "casing" i.e. they are usually all CAPITAL CASE. We want them to look more "transcript-y".
We can write a single function to convert from VTT to our format here instead of: a. writing a function to render both VTT and transcript on the frontend b. writing a function to process both VTT and transcript on any of our analysis and processing functions

short answer: "do the conversion during the pipeline so we dont have to have MORE code and duplicate processing downstream"

our format is much more extensive, we can add annotations and more analysis driven stuff to it. VTT is really just a text + timestamp format.

Shak2000 commented 2 years ago

Analysis: 1) >> does indeed represent a change in the speaker. I listened to the video and compared to the .vtt file 2) In the .vtt file, there is no hint as to who the speakers are. Thus, the transaction includes all the participants in the conversation with: "speaker_index":0, "speaker_name":null, 3) There are a few places where the .vtt file has a -. It seems that this occurs whenever there is a disconnect in the audio or the sound is unclear. The transcriber adds the - as it is 4) There are 4 places (indices 33, 47, 49 and 97) where the .vtt file has a -. After that, there is a change in the speaker (the next word starts with >>). In these cases, the transcriber represents the change in speaker using ->>

Solution: I wrote a very simple solution for this problem in the '_normalize_text' function, which replaces >> with a blank. However: 1) It is too specific to the Oakland instance. I want generalize it or enable each CDP instance to specify unwanted words 2) I won't rush to push the specific solution because I do not have a way to test that it did not break any other CDP instance (or are we alright with removing >> for every instance?)

WDYT?

dphoria commented 2 years ago

First, thank you Shak. Looks like you have figured out what those >> and - (thus ->>) are.

I can see pros and cons for your idea of allowing an instance to specify unwanted characters in transcripts. Do we do something like that already? i.e. Do we do some instance-specific "thing" during data cleanup/import into our models? If the answer is no (and it's OK with you don't know; I don't :sweat_smile: ), then I think the proposal may be overengineering. i.e. Just clean those out of every transcript for any instance, like you suggested in _normalize_text() or wherever appropriate.

That's my 2 cents for now...

tohuynh commented 2 years ago

Similarly there are also symbols I have never seen in a closed caption file like: "->> " which I assume also denote a new speaker.

If this is true, you'd just need to add an optional - to the regex here: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/sr_models/webvtt_sr_model.py#L59?

Btw, in the webvtt_sr_model.py, there is a default new turn pattern > (which is >).

Shak2000 commented 1 year ago

I want to summarize the current status: I found out that we don't need to change anything in the CDP program. There is already a config in the CDP instance caption_new_speaker_turn_pattern to set the change of speaker. As a result, I canceled the old pull request and created a new one in the Oakland instance.

evamaxfield commented 1 year ago

Handled on the oakland instance side. Merged and released. Will check all is well after new event is published.

CouncilDataProject / cdp-backend

Parse closed caption files for Oakland better #203