I developed my own dataset ~9.5 hours for the Arabic Bahraini dialect. My validation loss is around 1.5 . I think this is partly due to how I defined the Arabic symbols. Is my implementation correct? Could someone please help?

pad = '' _punctuation = '.!,؟*: ' _special = '-'

Phonemes

_vowels = 'واي' _non_pulmonic_consonants = '' _pulmonic_consonants = 'لإإلأابتثجحخدذرزسشصضطظعغفقكلمنهويءؤآ' _suprasegmentals = 'ˈˌːˑ' _other_symbols = '' _diacrilics = 'ّ' _extra_phons = [] # some extra symbols that I found in from wiktionary ipa annotations

_extra_phons = ['g', 'ɝ', '̃', '̍', '̥', '̩', '̯', '͡'] # some extra symbols that I found in from wiktionary ipa annotations

phonemes = list( _pad + _punctuation + _special + _vowels + _non_pulmonic_consonants

_pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics) + _extra_phons

phonemes_set = set(phonemes) silent_phonemes_indices = [i for i, p in enumerate(phonemes) if p in _pad + _punctuation]

NVIDIA / tacotron2

symbols.py for Arabic letters #603

Phonemes

_extra_phons = ['g', 'ɝ', '̃', '̍', '̥', '̩', '̯', '͡'] # some extra symbols that I found in from wiktionary ipa annotations