CouncilDataProject / speakerbox

Speakerbox: Fine-tune Audio Transformers for speaker identification.
https://councildataproject.org/speakerbox
MIT License
51 stars 6 forks source link

Portuguese Training #17

Closed DieguJota closed 1 year ago

DieguJota commented 1 year ago

Hello, thank you in advance for the tool that you have developed.

Just to share, I'm doing some tests with your tool in my Portuguese language. Initially I trained with 170 different speakers and approximately one hour of base. So far I have 86% accuracy. The quality of the audios is extremely bad, they are from the call center and have 8000 hz, but even so the accuracy of your engine is very good.

Of course I made some changes to your code, mainly in the prepare_dataset section, because there you depend on a conversation ID, which is a little different for my scenario where I only have separate folders per person, that's all. But other than that, it was pretty simple to train. It's even a suggestion for improvement that I suggest, I didn't see much need to separate the dataSet by conversation.

I opened an issue because I didn't have any discussion place in this repository, I just made it to share with you.

Epoch | Training Loss | Validation Loss | Accuracy

1 | 5.137100 | 5.217264 | 0.007317 2 | 4.836000 | 4.813613 | 0.046341 3 | 4.222300 | 4.175541 | 0.132927 4 | 3.446100 | 3.478938 | 0.236585 5 | 2.382000 | 2.762814 | 0.360976 6 | 2.422300 | 1.931921 | 0.502439 7 | 1.907200 | 1.598777 | 0.600000 8 | 1.262300 | 1.402867 | 0.613415 9 | 1.160200 | 1.040319 | 0.713415 10 | 1.402000 | 1.434696 | 0.632927 11 | 0.818900 | 1.030849 | 0.719512 12 | 0.636200 | 0.888412 | 0.774390 13 | 0.775500 | 0.888210 | 0.770732 14 | 0.180300 | 0.820457 | 0.785366 15 | 0.389300 | 0.771072 | 0.820732 16 | 0.305800 | 0.843936 | 0.791463 17 | 0.194600 | 0.796410 | 0.800000 18 | 0.372000 | 0.765828 | 0.820732 19 | 0.247800 | 0.736840 | 0.836585 20 | 0.196900 | 0.744377 | 0.834146 21 | 0.172800 | 0.661944 | 0.830488 22 | 0.145900 | 0.617652 | 0.852439 23 | 0.171100 | 0.709798 | 0.841463 24 | 0.238700 | 0.722369 | 0.852439 25 | 0.158100 | 0.705889 | 0.842683 26 | 0.124300 | 0.611421 | 0.867073 27 | 0.270200 | 0.647974 | 0.860976 28 | 0.178900 | 0.634960 | 0.859756 29 | 0.049200 | 0.567398 | 0.875610 30 | 0.025100 | 0.604317 | 0.864634

evamaxfield commented 1 year ago

Hey @DieguJota thanks for letting me know! Always exciting to hear about more usage of the project.

I agree that your use-case of "I have a directory full of speaker audios" isn't currently met by the library. It was very much developed for my own use-case and problem but I think there is room for a feature or two to make its way into the library. I am going to open a new issue with some thoughts.

Sidenote: hopefully the confusion matrix stored out after eval_model looks good. Accuracy is a good metric but can hide a lot of problems.

DieguJota commented 1 year ago

@evamaxfield Although illegible, you can see that the confusion matrix turned out pretty cool. 93% accuracy image

evamaxfield commented 1 year ago

@DieguJota this seems like something went wrong here.... Normally I would expect mostly purple / dark colors everywhere but the middle horizontal line.

Without knowing your problem, use case, and what data you have available, you might want to run a few more tests of the model.

DieguJota commented 1 year ago

@evamaxfield I had not noticed before, but it seems that the colors were inverted. I manually tested the validation audio files, and they were all excellent, with almost 100% accuracy and a high confidence level of being from the same person. However, I have a doubt and would appreciate your help.

I would like to know if the model you present would be suitable for a biometric authentication project. I tested the model with my voice, which is not included in the training set, and got a 99% confidence result that I was a different person. While recognizing previously trained people is a great feature, I was concerned about the model's high reliance on misidentifying untrained people. This can present a problem if the objective is to verify that a person is really who he claims to be.

evamaxfield commented 1 year ago

I am sorry, I can't and won't help on your problem. I just don't know enough about what you are working on or trying to achieve.

Continue at your own risk.