batumi / KartuliSpeechRecognition

ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა
Apache License 2.0
4 stars 1 forks source link

Convert .raw into .mp3 and extract timings for praat annotation #17

Closed cesine closed 10 years ago

cesine commented 10 years ago

If the user's audio results are saved in https://github.com/batumi/KartuliSpeechRecognition/issues/15 then if they edit the recognition results to what they actually said they will be able to train their own acoustic models (which will result in the most gain for accuracy).

Sample audio:

To do permit segment timed alignment in Praat, this is the command that needs to be added to the AudioWebService:

ffmpeg -f s16le -ar 8k -ac 2 -i audio_utterance_1408645116692.raw audio_utterance_1408645116692.wav

screen shot 2014-08-21 at 2 45 49 pm

Much better quality when run within the audio web service, where our ffmpeg command was more complete and it turned out to be more likely android wide band audio at 16k (which is wonderful):

 curl -k -F files[]=@testinstallpocketsphinx/android_8k.raw -F token=mytokengoeshere -F username=testingupload -F dbname=testingupload-firstcorpus $SERVER/upload/extract/utterances
ffmpeg -y  -f s16le -ar 16k  -i "AudioWebService/bycorpus/testingupload-firstcorpus/android_8k/android_8k.raw" -ac 1  "AudioWebService/rawdata/bf06325033e361980dbc41fbd5a368cdb5500671.mp3"

screen shot 2014-08-21 at 3 27 09 pm

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 5.625679012345679 
tiers? <exists> 
size = 1 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "silences" 
        xmin = 0 
        xmax = 5.625679012345679 
        intervals: size = 24 
        intervals [1]:
            xmin = 0 
            xmax = 0.23685914625640092 
            text = "" 
        intervals [2]:
            xmin = 0.23685914625640092 
            xmax = 0.32080147977709955 
            text = "ks" 
        intervals [3]:
            xmin = 0.32080147977709955 
            xmax = 0.7510059390706799 
            text = "" 
        intervals [4]:
            xmin = 0.7510059390706799 
            xmax = 0.8419434670514367 
            text = "s" 
        intervals [5]:
            xmin = 0.8419434670514367 
            xmax = 1.3176166903353954 
            text = "jitter" 
        intervals [6]:
            xmin = 1.3176166903353954 
            xmax = 1.754816344089034 
            text = "" 
        intervals [7]:
            xmin = 1.754816344089034 
            xmax = 1.9401889972805768 
            text = "s f" 
        intervals [8]:
            xmin = 1.9401889972805768 
            xmax = 2.331919887043837 
            text = "" 
        intervals [9]:
            xmin = 2.331919887043837 
            xmax = 2.744636360187272 
            text = "sizis" 
        intervals [10]:
            xmin = 2.744636360187272 
            xmax = 3.003458555209426 
            text = "" 
        intervals [11]:
            xmin = 3.003458555209426 
            xmax = 3.0769080970400373 
            text = "f" 
        intervals [12]:
            xmin = 3.0769080970400373 
            xmax = 3.4371606117330353 
            text = "" 
        intervals [13]:
            xmin = 3.4371606117330353 
            xmax = 3.5665717092441125 
            text = "s" 
        intervals [14]:
            xmin = 3.5665717092441125 
            xmax = 4.16083950617284 
            text = "" 
        intervals [15]:
            xmin = 4.16083950617284 
            xmax = 4.280839506172839 
            text = "silent" 
        intervals [16]:
            xmin = 4.280839506172839 
            xmax = 4.416487836141186 
            text = "" 
        intervals [17]:
            xmin = 4.416487836141186 
            xmax = 4.500430169661884 
            text = "t" 
        intervals [18]:
            xmin = 4.500430169661884 
            xmax = 4.584372503182583 
            text = "\d" 
        intervals [19]:
            xmin = 4.584372503182583 
            xmax = 4.5983628921027 
            text = "" 
        intervals [20]:
            xmin = 4.5983628921027 
            xmax = 4.710286003463631 
            text = "s" 
        intervals [21]:
            xmin = 4.710286003463631 
            xmax = 5.11283950617284 
            text = "" 
        intervals [22]:
            xmin = 5.11283950617284 
            xmax = 5.33683950617284 
            text = "silent" 
        intervals [23]:
            xmin = 5.33683950617284 
            xmax = 5.456839506172839 
            text = "sounding" 
        intervals [24]:
            xmin = 5.456839506172839 
            xmax = 5.625679012345679 
            text = "silent" 
cesine commented 10 years ago

deployed to audio service in https://github.com/OpenSourceFieldlinguistics/AudioWebService/pull/1