Closed evamaxfield closed 2 years ago
Merging #7 (685f824) into main (90011e4) will increase coverage by
80.66%
. The diff coverage is91.42%
.
@@ Coverage Diff @@
## main #7 +/- ##
===========================================
+ Coverage 10.74% 91.41% +80.66%
===========================================
Files 5 13 +8
Lines 121 361 +240
===========================================
+ Hits 13 330 +317
+ Misses 108 31 -77
Impacted Files | Coverage Δ | |
---|---|---|
speakerbox/tests/data/__init__.py | 0.00% <0.00%> (ø) |
|
speakerbox/utils.py | 75.00% <75.00%> (ø) |
|
speakerbox/datasets/seattle_2021_proto.py | 84.05% <84.05%> (ø) |
|
speakerbox/preprocess.py | 91.85% <91.85%> (ø) |
|
speakerbox/main.py | 98.52% <98.52%> (ø) |
|
speakerbox/__init__.py | 85.71% <100.00%> (+2.38%) |
:arrow_up: |
speakerbox/datasets/__init__.py | 100.00% <100.00%> (ø) |
|
speakerbox/tests/conftest.py | 100.00% <100.00%> (+16.66%) |
:arrow_up: |
speakerbox/tests/test_datasets.py | 100.00% <100.00%> (ø) |
|
speakerbox/tests/test_preprocess.py | 100.00% <100.00%> (ø) |
|
... and 2 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 90011e4...685f824. Read the comment docs.
I can't mark you as reviewer for some reason @kristopher-smith but if you want to leave comments please do!
Not sure how well I can contribute to this :sweat_smile: but will do my best and take a close look this weekend.
Will try to take a look at this sometime this week/weekend!
Not sure how well I can contribute to this :sweat_smile: but will do my best and take a close look this weekend.
I think looking over the README and saying "this does not make sense" is good enough for me
Fully trained model. I'm going to add more Juarez examples
Good questions!
- What's the general accuracy/precision you've seen from runs on larger data sets? The confusion matrix from the local run seems like the labels were generally pretty accurate
I have only tried training the full model twice. The first time it fully trained, it succeeded in training but it didn't save to S3 properly. That model reported a 98% accuracy but I don't have the train, test, validation split counts, or any more details other than this line from the log: Loading best model from trained-speakerbox/checkpoint-4128 (score: 0.9816171573198348).
Since split designation is random but with a few rules, I will add some more data for Juarez and lewis and expect ~94% - 99% accuracy.
- What's the split between test/training/validation data sets that you used? Is this hardcoded somewhere or can this be customized?
The above full model was trained, tested, and evaled on:
train_counts test_counts valid_counts
gonzalez 3503 713 928
herbold 2969 470 947
juarez 387 216 311
lewis 964 237 266
morales 700 306 447
mosqueda 2818 530 1007
pedersen 1132 176 351
sawant 584 682 564
strauss 1213 51 264
See here for more dataset prep info. Currently you can provide a seed to the splitting function but no explicit conversation ids. I will add an issue to the repo for such a feature but rn I don't think it's needed because that function has been working pretty well.
@tohuynh
I trained a bunch more models because machine learning is deterministic :upside_down_face:
best of the bunch:
equalized labels across splits:
random good performer without equalization
random good performer without equalization num 2
I feel like we could use any of these :shrug:
I feel like we could use any of these 🤷
Sick!
Curious to see the predicted probabilities of each data point for the valid set or just the cross-entropy loss of each model for the valid set.
I feel like we could use any of these 🤷
Sick!
Curious to see the predicted probabilities of each data point for the valid set or just the cross-entropy loss of each model for the valid set.
Those confusion matrices are from predictions on the holdout validation set 😀
The validation loss should be reported in the logs? I think... If not I think we can add an issue to the repo to add later?
The validation loss should be reported in the logs? I think... If not I think we can add an issue to the repo to add later?
Not sure if it is. Since the accuracy, precision, and recall scores were so similar across these models, I was curious if the loss was similar too.
Not sure if it is. Since the accuracy, precision, and recall scores were so similar across these models, I was curious if the loss was similar too.
Ahhhh yea we don't have loss for the validation but if you open those logs and use the search logs box for "eval_loss" you can jump to different epoch loss reports
if you open those logs and use the search logs box for "eval_loss" you can jump to different epoch loss reports
That's for the test set right? The best I saw was ~0.4.
I think we want at least 0.2 on the validation set?
Cross-Entropy = 0.00: Perfect probabilities. Cross-Entropy < 0.02: Great probabilities. Cross-Entropy < 0.05: On the right track. Cross-Entropy < 0.20: Fine. Cross-Entropy > 0.30: Not great. Cross-Entropy > 1.00: Terrible. Cross-Entropy > 2.00 Something is broken. (https://machinelearningmastery.com/cross-entropy-for-machine-learning/)
Since the model is going to have to predict Unknown
if the highest predicted probability for a real-world data point is too low (low confidence on this data point). I think it would be good to choose a model with low cross-entropy loss on the validation set -- it means the predicted probability distribution is very similar to the true probability distribution of a data point).
I want to change the accuracy, precision, and recall to weighted average instead of "macro" average as I think it is right now. I will also add in computing the loss as well :+1:
Will rerun a few models to see what happens.
Will also try to generate and store the training curves as well
Something of note:
Your loss might be hijacked by a few outliers (very wrong predictions), check the distribution of your loss function on individual samples of your validation set. If there are a cluster of values around the mean then you are overfitting. If there are just a few values very high above a low majority group then your loss is being affected by outliers :)
Hopefully, it's just some bad outliers that made eval_loss
so high.
Well now I am confused...
Here is the most recent model including the validation loss:
Logs: https://github.com/JacksonMaxfield/phd-infrastructures/runs/6478246169?check_suite_focus=true
I am wondering if I should just run more epochs?
I don't think that would improve anything, training loss is already approaching zero -- pretty much can't minimize training loss anymore. After that, training would be just picking up idiosyncrasies/noises of the data, instead of the general pattern of the data.
Could you examine the predicted probabilities of each datapoint in the validation set?
I think we need to see plots for training loss and eval_lost as training happens.
Overall, not sure about what conclusions to draw regarding what training learned from the data, if the accuracy, precision, and recall scores are good, but cross-entropy loss is terrible.
From a friend:
you could have some bonkers outliers
:joy: seems accurate
After a bit more discussion and thinking I have one idea as to what is happening and one decision.
I agree with my friend Greg that I think the reason our cross entropy is high is because some predictions for Gonzalez when the true speaker was Herbold are just really confident in the prediction. I.e. It predicts the speaker as Gonzalez with 95% confidence all 14 of those mis-classifications. I have a weird hypothesis as to why that may be happening however which is related to not the voice behind the audio, but the words. When I was annotating this data, I noticed and remembered that Herbold (and Juarez briefly) acted as interim Council Presidents while Gonzalez was on maternity leave. When I was annotating, there were a couple of meetings I used in the annotation set which included Herbold as the interim Council President and I have a hypothesis that the model partially learned which words are associated to which individuals on the councils as well as which waveforms. The larger more general hypothesis would be something like "speech recognition models can be trained more accurate by punishing / weighting predictions by ngram diversity" or something similar (Imagine if our training set had both Gonzalez and Herbold saying: "Will the clerk please call the roll?"). For now, advisors and I are going to shelf that idea simply due to time. But likely come back to it when we have more models trained for more instances in an attempt to understand if we can predict "roll" on the council by speech alone.
If you recall back to how I am applying these models to the transcript, I am chunking the audio of the sentences of the transcript, and for each sentence chunk predicting the speaker and then averaging the speaker for the whole sentence. With that in mind, I am going to semi-roll back some of the last three commits and ship this for now. With 97.65% accuracy I hope everyone is okay with that.
Besides the confidently wrong classifications you mentioned above, the rest are confidently (or good enough for us) right?
I have a weird hypothesis as to why that may be happening however which is related to not the voice behind the audio, but the words.
These confidently wrong classifications for audio segments of Herbold include words also spoken by Gonzalez (in the training set)? If Herbold and Gonzalez both said XYZ in the training set, wouldn't the model be able to pick up the difference?
With that in mind, I am going to semi-roll back some of the last three commits and ship this for now. With 97.65% accuracy I hope everyone is okay with that.
👍
These confidently wrong classifications for audio segments of Herbold include words also spoken by Gonzalez (in the training set)? If Herbold and Gonzalez both said XYZ in the training set, wouldn't the model be able to pick up the difference?
Right but their may be more examples of Gonzalez using such terminology than Herbold using such terminology and so the model sees the terminology as the defining factor and not the voice.
But to be clear. I don't know if that is truly the case or not. Purely speculating. The fact that everything else is incredibly accurate and says to me that its picking up voice. Especially because I know there were some meetings in the dataset where Juarez and Mosqueda were interim president.
Link to Relevant Issue
This pull request resolves #5, resolves #6, resolves #3
Description of Changes
A continuation of #1
Opening this up for comments as I am going to start training a full model for Seattle 2021 tomorrow or Friday. This PR basically implements everything needed for larger prototyping and training. I am sure some refactoring will happen down the road, some utility functions will be added and more but as it is right now, I can use this library to both quickly annotate and train models!
Because of how large the file diff is I might recommend simply viewing the repo from the branch view: https://github.com/CouncilDataProject/speakerbox/tree/feature/dataset-expansion
The workflow image doesn't render in the README but that's because the image isn't on the main branch yet, here is the workflow image: https://github.com/CouncilDataProject/speakerbox/blob/feature/dataset-expansion/docs/_static/images/workflow.png
Note the massive caviate that this library currently only works on Ubuntu due to upstream dependencies not yet building for other platforms.
You can somewhat ignore the confusion matrix on this repo / PR -- the training loss drops to 0 on CPU sometimes and it feels like a bug in the huggingface trainer API but I am not entirely sure. When I train on GPU I never run into problems with the model completely failing.
(For example here is the confusion matrix from running the tests locally with GPU enabled) (Yes, that is really with as little data as is in the stored test resources zipfile :tada: )
In this case, I am using the confusion matrix here to simply check that training ran / as a proof of concept for how to report back to instance maintainers the results of training. (I am thinking they will open a PR on their instance repos to add their annotation files and we can write a bot they can message to kick off a job to train and then report back something similar to this PR but with more helpful non-technical comments)
Please leave any and all comments!