TensorSpeech / TensorFlowASR

:zap: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords
https://huylenguyen.com/asr
Apache License 2.0
916 stars 242 forks source link

Question: Dataset #139

Open n0tOdd opened 3 years ago

n0tOdd commented 3 years ago

Hey i just wanted to ask if this idea is just garbage or if its something that could work. im trying to create a model for the norwegian language but i cant find any good datasets for our language, the only one i can find makes a shitty model that dosent understand half of what im saying. So i was thinking about creating a crawler to run on sites with norwegian videos that have subtitles and use youtube-dl to download the videos/subtitles and later convert the video file into short wav files with ffmpeg and save it as audioname1.wav and audioname1.wav to create a fresh dataset for my language. If this is something that could work, it should be capable of creating datasets for most languages and create a fully automated self teaching model for the language needed so im crossing my fingers for this being a good idea

tund commented 3 years ago

Hi @n0tMaster , after downloading the videos and subtitles, you would need to segment them into smaller chunks, for example, sentences. I dont think it's easy to do properly. You also might want to look at wav2vec 2.0 model. It's semi-supervised framework, so you dont need a lot of labeled data (transcripted audio).

n0tOdd commented 3 years ago

yeah thats why i take the subtitle file and use every entry from this to create smaler audio files with ffmpeg, the files i get after this has a length from 2-9 seconds and all the transcriptions ive checked manualy only contains one sentence. is this to long audio files or can this be used?

I just finished a test run where it downloaded 500GB of tv shows with subtitles and created audio files and text files matching the spoken words of the audio file, its just been running for a hour training deepspeech2 and the evaluation run has a ctc loss of 87.80209, i have no idea whats a good ctc loss after a hour of training is, but its better than what i got on the 20GB corpus i found for training the norwegian language so im keeping my fingers crossed for this to work

mjurkus commented 3 years ago

Might want to contribute to https://commonvoice.mozilla.org/en/languages to enable crowdsourcing for your language. That's what I did at least. And then promote the hell out of that thing to get ppl to contribute.

Using videos or audiobooks or something like that might be an issue - licensing, GDPR, and so on.

n0tOdd commented 3 years ago

Yeah i know licensing can be a problem so i sendt an email to the provider im grabbing the videos from and got the permission to use there data for training, but i am not allowed to share the dataset, only the finished product.

but i dont know if this is any good to use, im half way on epoch 2 now and the ctc loss with deepspeech2 is still at86.69503 so its 1 better on the loss after 15 hours of training, anybody have some experience with training deepspeech2 and can tell me if this is a ok result after 15 hours of training? Its the first time ive managed to get a speech to text framework up and running so i have no idea what im doing, thats why this thread is probably gonna end up with a bunch of questions from me and to everybody that has answerd and everybody thats gonna answer in the future. Thank you very much for all youre responses, its gonna save me ages of time trying to learn everything here by googling all around

yeah ive been on the mozzilas commonvoice pages a little bit but non of my friends udnerstand why im intrested in this, so im having a hard time getting any contributions from them, and im a lazy guy so i loss intrest to fast when im just talking to the computer screen, thats why i started on the idea of using videos instead

mjurkus commented 3 years ago

I haven't used this implementation of DeepSepeech2 and can't comment on it. How much data do you have? In hours? With different implementations (on PyTorch) I had about 150 hours of data (and I got feedback, that it's really not enough) and trained for at least 5 days to get something reasonable, but it still sucked :D

The loss metric should not your main indicator. It's useful when training, but in the end, you want to check the accuracy of the model. In this case WER/CER. You can stop training - do test, check the error rates and decide if it's good enough for you.

Try sharing on reddit other social media to get more attention for getting mode voices.