RicherMans / SpokenLanguageClassifiers

Pretrained spoken language classifiers from audio.
MIT License
8 stars 2 forks source link

Spoken Language Classification

This repository contains some models pretrained on the Voxlingua107 dataset to be used for spoken (audio based) language classification. The dataset (and therefore the models) can distinguish between 107 different types of languages. Four models are provided ( See below ).

Usage

git clone https://github.com/RicherMans/SpokenLanguageClassifiers
pip install -r requirements.txt
python3 predict.py AUDIOFILE

The models (see below) can be also modified. Currently four models have been pretrained. All of which are accessed with the --model MODELNAME parameter.

By default the models just print the top N results (N=5 and can be changed with --N NUMBER).

Models

Four models were pretrained and can be chosen as the back-end:

  1. CNN6 (default) : A six layer CNN model, using attention as temporal aggregation.
  2. CNN10: A ten layer CNN model, using mean and max pooling as temporal aggregation.
  3. MobilenetV2: A mobilenet implementation for audio classification.
  4. CNNVAD: A model that simultaneously does VAD and classification. The VAD model is taken from GPV and Data-driven GPVAD. Model training has been done by fine-tuning both VAD and Language classification models. The back-end model here is the default CNN6.

Since I don't have access to other datasets for cross-dataset evaluation, I provide the current performance on my held-out cross-validation dataset:

Model Precision Recall Accuracy
CNN6 81.7 84.4 83.6
CNN10 89.9 90.9 90.8
MobileNetV2 80.0 80.1 79.3
CNNVAD 81.0 82.4 82.9