Closed evamaxfield closed 1 year ago
I can work on this on weekends. Having said that, between work and school I have little free time, unfortunately. So @Shak2000 if you do have time and are willing, that would be better. I will wait until this weekend and then go from there.
Thank you Eva!
Let me take a look
I am not very familiar with transcripts, so I to first write a unit test for this case. To make this possible, could you please send me a link to examples of pages in the Oakland website as well as an example transcript or audio file?
@Shak2000 example webpage with transcript: https://councildataproject.org/oakland/#/events/e0912a619d69
script for downloading oakland resources:
from cdp_backend.database import models as db_models
from cdp_backend.utils import file_utils
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client
# Connect to the database
fireo.connection(client=Client(
project="cdp-oakland-ba81c097",
credentials=AnonymousCredentials()
))
# Get event
event = db_models.Event.collection.get("event/e0912a619d69")
# Get session
session = list(
db_models.Session.collection.filter("event_ref", "==", event.key).fetch()
)[0]
# Get transcript
transcript = list(
db_models.Transcript.collection.filter("session_ref", "==", session.key).fetch()
)[0]
transcript_file = transcript.file_ref.get()
# Connect to filestore
fs = GCSFileSystem(project="cdp-oakland-ba81c097", token="anon")
# Download transcript
fs.get(transcript_file.uri, "oakland-transcript.json")
# Download captions
file_utils.resource_copy(
"http://oakland.granicus.com//videos/5042/captions.vtt",
"oakland-captions.vtt",
overwrite=True,
)
# If you want to read the transcript as a Python object
with open("oakland-transcript.json", "r") as open_f:
read_transcript = Transcript.from_json(open_f.read())
print(read_transcript)
copy paste that into a file and call it with python and it will download two files, oakland-captions.vtt
and oakland-transcript.json
I managed to add the unit tests to the test_webvtt_sr_model.py
, and it has run successfully
I can now debug the transcribe
function on webvtt_sr_model.py
. Before I debug, I would like to better understand why we implemented a different model. If there is any documentation, I would be happy to read
Before I debug, I would like to better understand why we implemented a different model. If there is any documentation, I would be happy to read
I dont follow?
We haven't implemented a different model? / what model are you talking about?
One format/model is that we get vtt
from the municipality and we convert it into our own model. What is the logic behind this?
For multiple reasons:
short answer: "do the conversion during the pipeline so we dont have to have MORE code and duplicate processing downstream"
our format is much more extensive, we can add annotations and more analysis driven stuff to it. VTT is really just a text + timestamp format.
Analysis:
1) >>
does indeed represent a change in the speaker. I listened to the video and compared to the .vtt
file
2) In the .vtt
file, there is no hint as to who the speakers are. Thus, the transaction includes all the participants in the conversation with:
"speaker_index":0,
"speaker_name":null,
3) There are a few places where the .vtt
file has a -
. It seems that this occurs whenever there is a disconnect in the audio or the sound is unclear. The transcriber adds the -
as it is
4) There are 4 places (indices 33, 47, 49 and 97) where the .vtt
file has a -
. After that, there is a change in the speaker (the next word starts with >>
). In these cases, the transcriber represents the change in speaker using ->>
Solution: I wrote a very simple solution for this problem in the '_normalize_text' function, which replaces >>
with a blank. However:
1) It is too specific to the Oakland instance. I want generalize it or enable each CDP instance to specify unwanted words
2) I won't rush to push the specific solution because I do not have a way to test that it did not break any other CDP instance (or are we alright with removing >>
for every instance?)
WDYT?
First, thank you Shak. Looks like you have figured out what those >>
and -
(thus ->>
) are.
I can see pros and cons for your idea of allowing an instance to specify unwanted characters in transcripts. Do we do something like that already? i.e. Do we do some instance-specific "thing" during data cleanup/import into our models? If the answer is no (and it's OK with you don't know; I don't :sweat_smile: ), then I think the proposal may be overengineering. i.e. Just clean those out of every transcript for any instance, like you suggested in _normalize_text()
or wherever appropriate.
That's my 2 cents for now...
Similarly there are also symbols I have never seen in a closed caption file like: "->> " which I assume also denote a new speaker.
If this is true, you'd just need to add an optional -
to the regex here: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/sr_models/webvtt_sr_model.py#L59?
Btw, in the webvtt_sr_model.py
, there is a default new turn pattern >
(which is >
).
I want to summarize the current status: I found out that we don't need to change anything in the CDP program. There is already a config in the CDP instance caption_new_speaker_turn_pattern
to set the change of speaker. As a result, I canceled the old pull request and created a new one in the Oakland instance.
Handled on the oakland instance side. Merged and released. Will check all is well after new event is published.
I fixed the cookiecutter / infra build and deployed Oakland!
There are still minor problems with permissions / CORS settings so ignore the broken video (I will fix that tomorrow)
That all said, Oakland has closed caption files which we parse and convert into our transcript format like we do for Seattle and Boston, once again, there is some minor differences from boston and seattle that I would love to be able to fix.
example of oakland event: https://councildataproject.org/oakland/#/events/e0912a619d69
the transcript includes a bunch of
">> "
markings where there are new speakers. To me these should be filtered out properly.Similarly there are also symbols I have never seen in a closed caption file like:
"->> "
which I assume also denote a new speaker.cc @isaacna @Shak2000 @dphoria any of you free to take this one?