evamaxfield commented 2 years ago

Link to Relevant Issue

This pull request resolves #5, resolves #6, resolves #3

Description of Changes

A continuation of #1

Opening this up for comments as I am going to start training a full model for Seattle 2021 tomorrow or Friday. This PR basically implements everything needed for larger prototyping and training. I am sure some refactoring will happen down the road, some utility functions will be added and more but as it is right now, I can use this library to both quickly annotate and train models!

Because of how large the file diff is I might recommend simply viewing the repo from the branch view: https://github.com/CouncilDataProject/speakerbox/tree/feature/dataset-expansion

The workflow image doesn't render in the README but that's because the image isn't on the main branch yet, here is the workflow image: https://github.com/CouncilDataProject/speakerbox/blob/feature/dataset-expansion/docs/_static/images/workflow.png

Note the massive caviate that this library currently only works on Ubuntu due to upstream dependencies not yet building for other platforms.

You can somewhat ignore the confusion matrix on this repo / PR -- the training loss drops to 0 on CPU sometimes and it feels like a bug in the huggingface trainer API but I am not entirely sure. When I train on GPU I never run into problems with the model completely failing.

(For example here is the confusion matrix from running the tests locally with GPU enabled) validation-confusion (Yes, that is really with as little data as is in the stored test resources zipfile :tada: )

In this case, I am using the confusion matrix here to simply check that training ran / as a proof of concept for how to report back to instance maintainers the results of training. (I am thinking they will open a PR on their instance repos to add their annotation files and we can write a bot they can message to kick off a job to train and then report back something similar to this PR but with more helpful non-technical comments)

Please leave any and all comments!

codecov[bot] commented 2 years ago

Codecov Report

Merging #7 (685f824) into main (90011e4) will increase coverage by 80.66%. The diff coverage is 91.42%.

@@             Coverage Diff             @@
##             main       #7       +/-   ##
===========================================
+ Coverage   10.74%   91.41%   +80.66%     
===========================================
  Files           5       13        +8     
  Lines         121      361      +240     
===========================================
+ Hits           13      330      +317     
+ Misses        108       31       -77

Impacted Files	Coverage Δ
speakerbox/tests/data/__init__.py	`0.00% <0.00%> (ø)`
speakerbox/utils.py	`75.00% <75.00%> (ø)`
speakerbox/datasets/seattle_2021_proto.py	`84.05% <84.05%> (ø)`
speakerbox/preprocess.py	`91.85% <91.85%> (ø)`
speakerbox/main.py	`98.52% <98.52%> (ø)`
speakerbox/__init__.py	`85.71% <100.00%> (+2.38%)`	:arrow_up:
speakerbox/datasets/__init__.py	`100.00% <100.00%> (ø)`
speakerbox/tests/conftest.py	`100.00% <100.00%> (+16.66%)`	:arrow_up:
speakerbox/tests/test_datasets.py	`100.00% <100.00%> (ø)`
speakerbox/tests/test_preprocess.py	`100.00% <100.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 90011e4...685f824. Read the comment docs.

evamaxfield commented 2 years ago

I can't mark you as reviewer for some reason @kristopher-smith but if you want to leave comments please do!

dphoria commented 2 years ago

Not sure how well I can contribute to this :sweat_smile: but will do my best and take a close look this weekend.

isaacna commented 2 years ago

Will try to take a look at this sometime this week/weekend!

evamaxfield commented 2 years ago

Not sure how well I can contribute to this :sweat_smile: but will do my best and take a close look this weekend.

I think looking over the README and saying "this does not make sense" is good enough for me

evamaxfield commented 2 years ago

Fully trained model. I'm going to add more Juarez examples

evamaxfield commented 2 years ago

Good questions!

What's the general accuracy/precision you've seen from runs on larger data sets? The confusion matrix from the local run seems like the labels were generally pretty accurate

I have only tried training the full model twice. The first time it fully trained, it succeeded in training but it didn't save to S3 properly. That model reported a 98% accuracy but I don't have the train, test, validation split counts, or any more details other than this line from the log: Loading best model from trained-speakerbox/checkpoint-4128 (score: 0.9816171573198348).

Since split designation is random but with a few rules, I will add some more data for Juarez and lewis and expect ~94% - 99% accuracy.

What's the split between test/training/validation data sets that you used? Is this hardcoded somewhere or can this be customized?

The above full model was trained, tested, and evaled on:

           train_counts  test_counts  valid_counts
gonzalez          3503          713           928
herbold           2969          470           947
juarez             387          216           311
lewis              964          237           266
morales            700          306           447
mosqueda          2818          530          1007
pedersen          1132          176           351
sawant             584          682           564
strauss           1213           51           264

See here for more dataset prep info. Currently you can provide a seed to the splitting function but no explicit conversation ids. I will add an issue to the repo for such a feature but rn I don't think it's needed because that function has been working pretty well.

evamaxfield commented 2 years ago

@tohuynh

I trained a bunch more models because machine learning is deterministic :upside_down_face:

best of the bunch:

equalized labels across splits:
- Accuracy: 0.9709784411276948
- Precision: 0.9723741343113816
- Recall: 0.9709784411276949

training log

random good performer without equalization
- Accuracy: 0.9732294962277926
- Precision: 0.9734652240956935
- Recall: 0.9755533085841637

training log

random good performer without equalization num 2
- Accuracy: 0.9766917293233083
- Precision: 0.9726282428102516
- Recall: 0.9809027891876685

training log

I feel like we could use any of these :shrug:

tohuynh commented 2 years ago

I feel like we could use any of these 🤷

Sick!

Curious to see the predicted probabilities of each data point for the valid set or just the cross-entropy loss of each model for the valid set.

evamaxfield commented 2 years ago

I feel like we could use any of these 🤷

Sick!

Curious to see the predicted probabilities of each data point for the valid set or just the cross-entropy loss of each model for the valid set.

Those confusion matrices are from predictions on the holdout validation set 😀

The validation loss should be reported in the logs? I think... If not I think we can add an issue to the repo to add later?

tohuynh commented 2 years ago

The validation loss should be reported in the logs? I think... If not I think we can add an issue to the repo to add later?

Not sure if it is. Since the accuracy, precision, and recall scores were so similar across these models, I was curious if the loss was similar too.

evamaxfield commented 2 years ago

Not sure if it is. Since the accuracy, precision, and recall scores were so similar across these models, I was curious if the loss was similar too.

Ahhhh yea we don't have loss for the validation but if you open those logs and use the search logs box for "eval_loss" you can jump to different epoch loss reports

tohuynh commented 2 years ago

if you open those logs and use the search logs box for "eval_loss" you can jump to different epoch loss reports

That's for the test set right? The best I saw was ~0.4.

I think we want at least 0.2 on the validation set?

Cross-Entropy = 0.00: Perfect probabilities. Cross-Entropy < 0.02: Great probabilities. Cross-Entropy < 0.05: On the right track. Cross-Entropy < 0.20: Fine. Cross-Entropy > 0.30: Not great. Cross-Entropy > 1.00: Terrible. Cross-Entropy > 2.00 Something is broken. (https://machinelearningmastery.com/cross-entropy-for-machine-learning/)

Since the model is going to have to predict Unknown if the highest predicted probability for a real-world data point is too low (low confidence on this data point). I think it would be good to choose a model with low cross-entropy loss on the validation set -- it means the predicted probability distribution is very similar to the true probability distribution of a data point).

evamaxfield commented 2 years ago

I want to change the accuracy, precision, and recall to weighted average instead of "macro" average as I think it is right now. I will also add in computing the loss as well :+1:

Will rerun a few models to see what happens.

evamaxfield commented 2 years ago

Will also try to generate and store the training curves as well

tohuynh commented 2 years ago

Something of note:

Your loss might be hijacked by a few outliers (very wrong predictions), check the distribution of your loss function on individual samples of your validation set. If there are a cluster of values around the mean then you are overfitting. If there are just a few values very high above a low majority group then your loss is being affected by outliers :)

Hopefully, it's just some bad outliers that made eval_loss so high.

evamaxfield commented 2 years ago

Well now I am confused...

Here is the most recent model including the validation loss: validation-confusion

Accuracy: 0.9763440860215054
Precision: 0.9772270405532067
Recall: 0.9763440860215054
Validation Loss: 15.127711990008887

Logs: https://github.com/JacksonMaxfield/phd-infrastructures/runs/6478246169?check_suite_focus=true

evamaxfield commented 2 years ago

I am wondering if I should just run more epochs?

tohuynh commented 2 years ago

I don't think that would improve anything, training loss is already approaching zero -- pretty much can't minimize training loss anymore. After that, training would be just picking up idiosyncrasies/noises of the data, instead of the general pattern of the data.

tohuynh commented 2 years ago

Could you examine the predicted probabilities of each datapoint in the validation set?

tohuynh commented 2 years ago

I think we need to see plots for training loss and eval_lost as training happens.

Overall, not sure about what conclusions to draw regarding what training learned from the data, if the accuracy, precision, and recall scores are good, but cross-entropy loss is terrible.

evamaxfield commented 2 years ago

From a friend:

you could have some bonkers outliers

:joy: seems accurate

evamaxfield commented 2 years ago

After a bit more discussion and thinking I have one idea as to what is happening and one decision.

I agree with my friend Greg that I think the reason our cross entropy is high is because some predictions for Gonzalez when the true speaker was Herbold are just really confident in the prediction. I.e. It predicts the speaker as Gonzalez with 95% confidence all 14 of those mis-classifications. I have a weird hypothesis as to why that may be happening however which is related to not the voice behind the audio, but the words. When I was annotating this data, I noticed and remembered that Herbold (and Juarez briefly) acted as interim Council Presidents while Gonzalez was on maternity leave. When I was annotating, there were a couple of meetings I used in the annotation set which included Herbold as the interim Council President and I have a hypothesis that the model partially learned which words are associated to which individuals on the councils as well as which waveforms. The larger more general hypothesis would be something like "speech recognition models can be trained more accurate by punishing / weighting predictions by ngram diversity" or something similar (Imagine if our training set had both Gonzalez and Herbold saying: "Will the clerk please call the roll?"). For now, advisors and I are going to shelf that idea simply due to time. But likely come back to it when we have more models trained for more instances in an attempt to understand if we can predict "roll" on the council by speech alone.
If you recall back to how I am applying these models to the transcript, I am chunking the audio of the sentences of the transcript, and for each sentence chunk predicting the speaker and then averaging the speaker for the whole sentence. With that in mind, I am going to semi-roll back some of the last three commits and ship this for now. With 97.65% accuracy I hope everyone is okay with that.

tohuynh commented 2 years ago

Besides the confidently wrong classifications you mentioned above, the rest are confidently (or good enough for us) right?

I have a weird hypothesis as to why that may be happening however which is related to not the voice behind the audio, but the words.

These confidently wrong classifications for audio segments of Herbold include words also spoken by Gonzalez (in the training set)? If Herbold and Gonzalez both said XYZ in the training set, wouldn't the model be able to pick up the difference?

With that in mind, I am going to semi-roll back some of the last three commits and ship this for now. With 97.65% accuracy I hope everyone is okay with that.

👍

evamaxfield commented 2 years ago

These confidently wrong classifications for audio segments of Herbold include words also spoken by Gonzalez (in the training set)? If Herbold and Gonzalez both said XYZ in the training set, wouldn't the model be able to pick up the difference?

Right but their may be more examples of Gonzalez using such terminology than Herbold using such terminology and so the model sees the terminology as the defining factor and not the voice.

evamaxfield commented 2 years ago

But to be clear. I don't know if that is truly the case or not. Purely speculating. The fact that everything else is incredibly accurate and says to me that its picking up voice. Especially because I know there were some meetings in the dataset where Juarez and Mosqueda were interim president.

CouncilDataProject / speakerbox

feature/preprocess-train-eval-and-more #7

Link to Relevant Issue

Description of Changes

Codecov Report