loss:inf and GPU with low utilization rate

phecda-xu commented 5 years ago

Thanks for your excellent work!

question 1

I trained with new dataSet, but loss is 'inf'

train.cfg details, on linux ubuntu 16.04

--datadir=/data/CNSpeech/
--tokensdir=/data/CNSpeech/
--rundir=/data/CNSpeech/
--archdir=./wav2letter/tutorials/AIshell/
--train=data/train
--valid=data/dev
--input=wav
--arch=network.arch
--tokens=data/tokens.txt
--criterion=ctc
--lr=0.1
--maxgradnorm=1.0
--replabel=2
--surround=|
--onorm=target
--sqnorm=true
--mfsc=true
--filterbanks=40
--nthread=8
--batchsize=4
--runname=AIshell_trainlogs
--iter=50

train logs

I0215 01:21:18.356604   671 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:07 | bch(ms): 70.58 | smp(ms): 46.81 | fwd(ms): 12.42 | crit-fwd(ms): 9.88 | bwd(ms): 5.22 | optim(ms): 1.17 | loss:        inf | train-TER: 99.96 | data/dev-TER: 99.89 | avg-isz: 524 | avg-tsz: 030 | max-tsz: 069 | hrs:    0.58 | thrpt(sec/sec): 297.15
I0215 01:22:41.344269   671 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:07 | bch(ms): 73.59 | smp(ms): 50.86 | fwd(ms): 11.64 | crit-fwd(ms): 9.20 | bwd(ms): 4.89 | optim(ms): 1.19 | loss:        inf | train-TER: 99.98 | data/dev-TER: 100.00 | avg-isz: 461 | avg-tsz: 027 | max-tsz: 055 | hrs:    0.51 | thrpt(sec/sec): 250.96
I0215 01:24:02.712700   671 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:07 | bch(ms): 79.25 | smp(ms): 55.30 | fwd(ms): 11.64 | crit-fwd(ms): 9.16 | bwd(ms): 5.10 | optim(ms): 1.29 | loss:        inf | train-TER: 100.00 | data/dev-TER: 100.00 | avg-isz: 485 | avg-tsz: 029 | max-tsz: 058 | hrs:    0.54 | thrpt(sec/sec): 245.08
I0215 01:25:25.598361   671 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:07 | bch(ms): 74.09 | smp(ms): 50.92 | fwd(ms): 11.65 | crit-fwd(ms): 9.19 | bwd(ms): 5.11 | optim(ms): 1.22 | loss:        inf | train-TER: 100.00 | data/dev-TER: 100.00 | avg-isz: 489 | avg-tsz: 028 | max-tsz: 054 | hrs:    0.54 | thrpt(sec/sec): 264.38
I0215 01:26:45.659701   671 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:07 | bch(ms): 75.23 | smp(ms): 52.76 | fwd(ms): 11.75 | crit-fwd(ms): 9.32 | bwd(ms): 5.01 | optim(ms): 1.16 | loss:        inf | train-TER: 100.00 | data/dev-TER: 100.00 | avg-isz: 484 | avg-tsz: 027 | max-tsz: 059 | hrs:    0.54 | thrpt(sec/sec): 257.56

Any ideals to solve this problem? Thanks!

question 2

training seems slowly, i use one TITAN V GPU device with 12G memory,but utilization rate only about 38%.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:01:00.0 Off |                  N/A |
| 37%   53C    P2    46W / 250W |   4028MiB / 12035MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3970      C   ./wav2letter/build/Train                    4017MiB |
+-----------------------------------------------------------------------------+

Is there anything wrong I did or how to improve the utilization rate? looking for your help! Thanks!

jacobkahn commented 5 years ago

@phecda-xu — a snapshot of nvidia-smi may not be the best proxy for how much the GPU is actually being used considering there are some CPU-bound parts of training. Is the lower usage you see consistent over time? How many cores is your CPU/how much memory is available?

It's possible your model is simply diverging from the start. I'd start by trying a few suggestions in https://github.com/facebookresearch/wav2letter/issues/168. Certainly decrease your learning rate to see if you can get a valid loss.

phecda-xu commented 5 years ago

@phecda-xu — a snapshot of nvidia-smi may not be the best proxy for how much the GPU is actually being used considering there are some CPU-bound parts of training. Is the lower usage you see consistent over time? How many cores is your CPU/how much memory is available?

It's possible your model is simply diverging from the start. I'd start by trying a few suggestions in #168. Certainly decrease your learning rate to see if you can get a valid loss.

@jacobkahn Thanks for your reply! I have checked it again and you are right, the usage of GPU is volatile. I changed the "lr" from 0.1 to 0.0001 to get a valid loss,and it works! Thank you very much!

keshawnhsieh commented 5 years ago

@phecda-xu I am working on Chinese speech recognition. I am wondering if you could kindly send me the source code on preprocessing aishell before putting it into Wav2Letter++ training. Thx.

phecda-xu commented 5 years ago

@phecda-xu I am working on Chinese speech recognition. I am wondering if you could kindly send me the source code on preprocessing aishell before putting it into Wav2Letter++ training. Thx.

@keshawnhsieh I processed Aishell with pandas as I have collected txt information into a csv file. like this:

|Content|FileName    |URL       |
|中文文本|音频的原始名称|音频存储路径|

code is here,may have some error, so be careful!


# coding: utf-8
# by:phecda xu
# processing Aishell data for wav2letter++
import os
import pandas as pd
import shutil
import jieba

base_path = 'data/'

data_aishell = pd.read_csv('data/AISHELL/AISHELL-1.csv')[['Content','FileName','URL']]

def seg_dataSet(dataFrame):
    data_test = dataFrame[:int(dataFrame.shape[0]*0.05)]
    data_dev = dataFrame[int(dataFrame.shape[0]*0.05) : int(dataFrame.shape[0]*0.1)]
    data_train = dataFrame[int(dataFrame.shape[0]*0.1) :]
    print(data_test.shape, data_dev.shape, data_train.shape, dataFrame.shape)
    return data_test,data_dev,data_train

data_aishell_test, data_aishell_dev ,data_aishell_train = seg_dataSet(data_aishell)

train_data = data_aishell_train.reset_index()
del train_data['index']

test_data = data_aishell_test.reset_index()
del test_data['index']

dev_data = data_aishell_dev.reset_index()
del dev_data['index']

def pre_processing_content(dataFrame,folder):
    dataFrame['path'] = dataFrame.URL.apply(lambda x:base_path + '/'.join(x.split('/')[6:]))
    dataFrame['FileName'] = dataFrame.FileName.apply(lambda x:x.split('.')[0])
    dataFrame['id'] = dataFrame.FileName.index.tolist()
    dataFrame['new_FileName'] = dataFrame.id.apply(lambda x : "%09d"%x)
    dataFrame['cut'] = dataFrame.Content.apply(lambda x:' '.join(jieba.lcut(x)))
    dataFrame['token'] = dataFrame.cut.apply(lambda x:' '.join(list(x.replace(' ', '|'))))
    dataFrame['newfile'] = dataFrame.new_FileName.apply(lambda x:'wav2letter++/AISHELL/data/'+ folder + '/' + x +'.wav')
    return dataFrame

train = pre_processing_content(train_data,'train')
test = pre_processing_content(test_data,'test')
dev = pre_processing_content(dev_data,'dev')

train = train[['cut','token' ,'path','id','new_FileName','newfile']]
test = test[['cut','token' ,'path','id','new_FileName','newfile']]
dev = dev[['cut','token' ,'path','id','new_FileName','newfile']]

def write_id(basepath):
    with open(folder_basepath +'/'+ basepath + ".id", "w") as f:
        f.write("file_id\t{fid}".format(fid=int(basepath)))
def write_tkn(basepath, token):
    with open(folder_basepath +'/'+ basepath + ".tkn", "w") as f:
        f.write(token)
def write_wrd(basepath, cut):
    with open(folder_basepath +'/'+ basepath + ".wrd", "w") as f:
        f.write(cut)
def copy(file_path,newfile):
    shutil.copyfile(file_path, newfile)

def export_data(dataFrame):
    list(map(write_wrd,dataFrame['new_FileName'],dataFrame['cut']))
    list(map(write_tkn,dataFrame['new_FileName'],dataFrame['token']))
    list(map(write_id,dataFrame['new_FileName']))
    list(map(copy,dataFrame['path'], dataFrame['newfile']))

def main(path, mode):
    exist = os.path.exists(path + mode)
    folder_basepath = path + mode
    if exist:
        print(folder_basepath)
    else:
        os.makedirs(path + mode)
        print('mkdir :',folder_basepath)
    #export_data(mode)

if __name__=="__main__":
    main('wav2letter++/AISHELL/data/','test')
    main('wav2letter++/AISHELL/data/','dev')
    main('wav2letter++/AISHELL/data/','train')

GabrielLin commented 5 years ago

@phecda-xu Could you please share your train log and test result of AIShell. Do you use Chinese language model? Thanks.

flashlight / wav2letter

loss:inf and GPU with low utilization rate #206