axinc-ai / ailia-models

The collection of pre-trained, state-of-the-art AI models for ailia SDK
2.02k stars 320 forks source link

ADD wav2vec2 #421

Open kyakuno opened 3 years ago

kyakuno commented 3 years ago

https://huggingface.co/transformers/model_doc/wav2vec2.html SpeechRecognition using BERT

kyakuno commented 3 years ago

transformersからonnxへの変換の例は下記。ただし、まだwav2vec2が変換できるかは不明。 https://github.com/axinc-ai/bert-japanese-onnx

kyakuno commented 3 years ago

なかなか苦労していそう。 https://github.com/pytorch/fairseq/issues/3010

kyakuno commented 3 years ago

こちらのモデルには日本語も含まれていそう。xlsr_53_56k.ptだけで3.5GBある。 https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

Model Architecture Hours Languages Datasets Model
XLSR-53 Large 56k 53 MLS, CommonVoice, BABEL download

The XLSR model uses the following datasets for multilingual pretraining:

MLS: Multilingual LibriSpeech (8 languages, 50.7k hours): Dutch, English, French, German, Italian, Polish, Portuguese, Spanish

CommonVoice (36 languages, 3.6k hours): Arabic, Basque, Breton, Chinese (CN), Chinese (HK), Chinese (TW), Chuvash, Dhivehi, Dutch, English, Esperanto, Estonian, French, German, Hakh-Chin, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Mongolian, Persian, Portuguese, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Welsh (see also finetuning splits from this paper).

Babel (17 languages, 1.7k hours): Assamese, Bengali, Cantonese, Cebuano, Georgian, Haitian, Kazakh, Kurmanji, Lao, Pashto, Swahili, Tagalog, Tamil, Tok, Turkish, Vietnamese, Zulu

kyakuno commented 3 years ago

CommonVoiceデータセットというのがあるみたい。日本語は306Speaker、Englishは69610Speaker。 https://commonvoice.mozilla.org/en/languages

kyakuno commented 3 years ago

試していないけどonnxへの変換は動くはずだよと書かれている。 https://github.com/pytorch/fairseq/issues/2972

mthrok commented 3 years ago

https://github.com/pytorch/fairseq/issues/3010#issuecomment-856053528