flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

OOM when training stream convnet on custom data #937

Closed tranmanhdat closed 3 years ago

tranmanhdat commented 3 years ago

Bug Description

Memory does not release when training model in buff/cache of memory increase when start train and lead to OOM

Reproduction Steps

Follow tutorials, train on custom data( approximate 500 hours~ 150gb

Platform and Hardware

Ubuntu 18.04.5 LTS Intel(R) Xeon(R) CPU E5-2690 v4@ 2.60GHz 2 GPUs Tesla V100 64 GB RAM

Additional Context

Run with docker image cuda lastest, architect same tutorial my flagsfile Screenshot from 2021-01-14 22-06-28 memory when run training Screenshot from 2021-01-14 22-02-37 my Gpus Screenshot from 2021-01-14 22-01-42 train process Screenshot from 2021-01-14 22-09-13

tlikhomanenko commented 3 years ago

What is your longest audio?

tranmanhdat commented 3 years ago

What is your longest audio?

approximate 24s, i find out after i release cached then restart docker, bug had gone, maybe docker container didn't release all cache!