Open isaacna opened 3 years ago
@JacksonMaxfield feel free to add any thoughts/ideas!
I think the most difficult part will be attaching identities to transcripts that are already speaker differentiated
Use GCP's built-in speaker diarization to separate the speakers. We could also create our own audio classification model. We could also use something like Prodigy to annotate the data, but I believe they have their own diarization/transcription models as well.
We can't use speaker diarization. It has a limit of 8 speakers because it uses some clustering under the hood or they are just arbitrarily imposing some limit.
The best method in my mind would be to fine tune a speech classification model with our own labels (gonzalez, sawant, mosqueda, etc).
https://huggingface.co/superb/wav2vec2-base-superb-sid
Here is a decent notebook on fine tuning and using the hugging face / transformers API for fine tuning an existing transformer: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/audio_classification.ipynb
The idea in my head is:
(I am arbitrarely choosing 5 seconds here because I don't know how long of an audio clip these pretrained audio classifiers allow in / how much memory is needed to train -- if we can fit a 30 second audio clip in then just do that.... and replace any reference I make to 5 seconds with 30 seconds -- find the audio clip size that fits in memory and performs well)
evaluate the model. until we have a workable model that accurately predicts which council members are speaking at any given time say 98%+ of the time?? then we don't really care about how it's applied.
if we don't hit the mark we may need more and more varied data.
BUT.
if we were to talk about how to apply this model into production I would say:
take a transcript, loop over each sentence in the transcript, if the sentence timespan is under 5 seconds, just run it through the trained model.
if the sentence timespan is over 5 seconds, chunk the sentence into multiple 5 second spans (or close to that), then predict all chunks and return the most common value predicted from the group as the sentence speaker, which if our sentence splits are accurate (which they generally are from what I have seen) should just be acting as a safety measure.
thoughts?
We can't use speaker diarization. It has a limit of 8 speakers because it uses some clustering under the hood or they are just arbitrarily imposing some limit.
From the code example here they set the speakers to 10 so it may be higher? But if there is any limit at all then yeah I agree it's probably just better to create our own model
Here is a decent notebook on fine tuning and using the hugging face / transformers API for fine tuning an existing transformer: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/audio_classification.ipynb
This seems like a really good resource! Just from skimming it seems pretty detailed and straightforward.
if the sentence timespan is over 5 seconds, chunk the sentence into multiple 5 second spans (or close to that), then predict all chunks and return the most common value predicted from the group as the sentence speaker, which if our sentence splits are accurate (which they generally are from what I have seen) should just be acting as a safety measure.
I'm assuming there's some library that makes it easy to split the audio file into chunks based on timestamp? Also, I do think this puts faith in sentence split accuracy. One thing we could consider is also enabling GCP speaker diarization and use that to cross check the sentence splits? We may not need it, and we wouldn't be using the diarization for identification necessarily, but it could help making which audio clips we feed to the model higher quality.
Also once we find the speaker are we writing it to the existing transcript as well (I see that our current transcript has speaker_name
as null from the diarization we opt out of)?
Or did you only want to create a separate output that stores a speaker to clip relation like you mention in the roadmap issue?
From the code example here they set the speakers to 10 so it may be higher? But if there is any limit at all then yeah I agree it's probably just better to create our own model
My main concern is the using google will cost way more than just training and applying our own model that we can apply during a GH action.
I'm assuming there's some library that makes it easy to split the audio file into chunks based on timestamp? Also, I do think this puts faith in sentence split accuracy. One thing we could consider is also enabling GCP speaker diarization and use that to cross check the sentence splits? We may not need it, and we wouldn't be using the diarization for identification necessarily, but it could help making which audio clips we feed to the model higher quality.
I agree that it puts faith into our sentence split accuracy. I think the interesting test will be to see how many sentences that are longer than 5 seconds and we do the chunking on and then the prediction on each chunk result in classifications of multiple people for the same sentence.
Also once we find the speaker are we writing it to the existing transcript as well (I see that our current transcript has speaker_name as null from the diarization we opt out of)?
I would store it to the transcript at speaker_name
but again, this is all "application" after we evaluate whatever model we have.
My main concern is the using google will cost way more than just training and applying our own model that we can apply during a GH action.
That makes sense, we should try to keep the cost as low as possible
I think the interesting test will be to see how many sentences that are longer than 5 seconds and we do the chunking on and then the prediction on each chunk result in classifications of multiple people for the same sentence.
Yeah, we could do some testing and adjust the time interval too if 5 seconds doesn't seem to be accurate
Adding more here.
Don't worry about it now but just think about it:
I think the problem will become: "how little training data can we use to fine tune a pre-existing model to get 98% accuracy or better" and "how can we create this model fine tuning system in the cookiecutter"
In my head, the cookiecutter should have a folder in the Python dir that is called like "models" and then subdirs in there that is each custom model we have. So:
instance-name/
web/
python/
models/
speaker-classification/
data/
{person-id-from-db}/
0.mp3
1.mp3
{person-id-from-db}/
0.mp3
1.mp3
Then we have a CRON job that on push it trains from this data? But this depends on how much data is needed. If a ton of data is needed for training then I don't think we should push all of these mp3s into the repo.
We could construct a config file in the speaker-classification
folder. I.e.
{
"{person-id-from-db}": {
"{some-meeting-id}": [
{"start-time": 0.0, "end-time": 3.2},
{"start-time": 143.6, "end-time": 147.3}
],
"{different-meeting-id}": [
{"start-time": 1.4, "end-time": 3.2},
{"start-time": 578.9, "end-time": 580.0}
]
},
"{different-person-id-from-db}": {
"{some-meeting-id}": [
{"start-time": 44.3, "end-time": 47.1},
{"start-time": 222.5, "end-time": 227.2}
],
"{different-meeting-id}": [
{"start-time": 12.3, "end-time": 14.5},
{"start-time": 781.1, "end-time": 784.2}
],
}
}
And the training process pulls the audio files, finds those snippets, cuts them, then trains from them.
But what I am getting at is that while we can figure out and fine tune a model all we want, figuring out how to deploy this system to each instance will be interesting.
Additionally, maybe we store the "audio classification" accuracy metric in the database in the metrics collection / metadata collection that we discussed #117.
So that we can have a page of stats on each instance. /models
and see the accuracy of all of our models for that instance?
Still.... just try to get any fine tuning of the pre-trained model working and evaluated. Application doesn't matter if the system just doesn't work in the first place.
Going to update this issue because there has been a lot of progress on this and I want to start getting ideas about how to move to production.
I am going to keep working on the model, definitely need to annotate a few more meetings to get a more diverse dataset for training and eval. But other things we need to do / consider.
As @tohuynh brought up: "how do we train these models for people?" I think I have briefly mentioned that I think one method for doing this will be to leave directions in the "admin-docs" section of an instances repo: https://github.com/CouncilDataProject/seattle-staging/tree/main/admin-docs
That explains to store the gecko annotation files in the repo in some directory. There needs to be a github action that the maintainer can manually kick off to run a "pre-dataset eval", then if the dataset passes pre-eval, training, then after train, storage of the model.
What I mean by pre-eval is basically answering @tohuynh's comments regarding:
if all of those questions "pass" the github action can move onto training a model. I am not sure if we can train a model using the 6 hour CPU time given to use by GitHub actions, especially with loading the model, loading the audio files into memory, storing the model, and all that stuff, so, to speed up training and actually make it possible, we may want to figure out how to kick off this training on google cloud.
Once the model is trained, store it to the instances file store, but also report the evaluation as a part of the github action logs so the maintainer doesn't need to check some other resource for that info.
If the model meets some accuracy threshold we can automatically flag it as "production" maybe and then store the model info in the metadata collection on firestore to use during the event gather pipeline?
Also, in my function to apply the trained model across a transcript, I already have a "threshold" value that says the predicted label for each sentence must be at least 98.5% confidence. If it's below that threshold we simply store speaker_name
as None
. I may drop this to just 98% but we can argue about that :shrug:
Transcript format should also have an annotation added that store the speaker classification model metadata. Speakerbox version, validation accuracy and maybe each sentence should have a "speaker_confidence" stored as well :shrug:
Some frontend thoughts:
For transcript format, I would personally like to store the speaker_name
as the person's actual name, not the speaker id. Is that okay @tohuynh @BrianL3? On the frontend I know we have the person picture but it would also be good to show their name imo.
Similarly, I am wondering if it's possible to have like a "report incorrect speaker label" button or something? Not sure how that would work.... but something to consider. But because of this, its a tie into: https://github.com/CouncilDataProject/cdp-frontend/issues/141
(related to above, I am somewhat missing my "show dev stats" button that was available way back on v1 of CDP... was useful for showing like transcript confidence and now would be useful for show each sentence speaker confidence and so on haha)
is there simply enough data
If the required number of training examples is too large, we can try audio augmentation by slightly changing the audio while preserving the speaker label in order to create new training examples from collected examples. Sorta like changing the lighting on an image to create more images. Not sure yet about what transformations could be done to audio.
For transcript format, I would personally like to store the speaker_name as the person's actual name, not the speaker id. Is that okay @tohuynh @BrianL3? On the frontend I know we have the person picture but it would also be good to show their name IMO.
It would be nice to have all three - id, name, picture. We need the id, if we want to link the person (who is also in the DB) to the person page.
Another thing to add to the documentation is that the training examples should be typical and/or representative examples(like all members present, typical recording environment), which is obvious to us, but in just case.
If the required number of training examples is too large, we can try audio augmentation by slightly changing the audio while preserving the speaker label in order to create new training examples from collected examples. Sorta like changing the lighting on an image to create more images. Not sure yet about what transformations could be done to audio.
Yep. I think this can easily be done during the training process. Most easily I can just increase the chunking. Having different lengths of chunks that get padded or truncated, etc.
It would be nice to have all three - id, name, picture. We need the id, if we want to link the person (who is also in the DB) to the person page.
I can add a field for the speaker_id to the transcript model but i feel like adding a field for the speaker_image_uri is a bit overkill imo.
That's fine if we don't want the person's picture next to their name. It would save some time to have the URL right away instead of having to query for the person
, populate the file_ref
, download the URL from gs URI to display the image.
That's fine if we don't want the person's picture next to their name. It would save some time to have the URL right away instead of having to query for the
person
, populate thefile_ref
, download the URL from gs URI to display the image.
Yea I don't think we need the person's photo. But can we test out the load times when we are ready to test this? A part of me says that because we are already pulling person info as a part of the voting record for the meeting it should be adding too much time but more time from what it already is could be a problem.
The load times, if the renderable URL is stored, would be the load times of fetching all council members' photos (for that transcript). And if the council member speaks again later in the transcript, the browser would use the cache.
The load times, if the renderable URL is not stored, would be too long.
I agree that a person's photo is not absolutely necessary (just makes it look a little nicer). Edit: I'm OK with not displaying the photo
Coming back to this to leave notes of what I plan on doing in the coming weeks:
Feature Description
Backend issue for the relevant roadmap issue
Adding speaker classification to CDP transcripts. This could be through a script/class that retroactively attaches the speaker name to a transcript that already has speaker diarization enabled. Prodigy can be used for annotating the training data.
Use Case
With speaker classification we can provide transcripts annotated with the speaker. This can be used in many ways such as through a script or github action
Solution
Very high level idea would be to:
A bigger picture breakdown of all the major components can be found on the roadmap issue under "Major components".