Closed mpierre0220 closed 3 years ago
@mpierre0220 — I'm sorry you're having trouble. We're pretty responsive here to Github issues, so if you're having a problem, you should open an issue. We're far more responsive here than on Facebook — the wav2letter group there is a place for the community to interact, but if you have issues that are blocking you from basic usage, you should post here.
I'd especially recommend against spending days fighting through issues when we might be able to immediately answer your question/help you around an obstacle we've seen before. It can turn a 5 day snag into a 5 minute snag.
A fix for #375 is already in https://github.com/facebookresearch/wav2letter/pull/435, but I will prioritize getting this out today as a standalone fix since it seems to be reproing in other places. This only occurs with some versions of gcc in certain settings, so we haven't been able to repro it at all after trying across a ton of different environments, which is why we haven't been able to test a fix.
problem is in the ArrayFire driver
If you're having an issue with ArrayFire, you should take that issue to https://github.com/arrayfire/arrayfire. They are also very responsive and helpful.
this code is doomed to fail
Again, your question is about ArrayFire and you should open an issue there if you think there's a problem, but your analysis of this code is incorrect on a few different levels:
deviceMemInfo
doesn't interact with the CUDA driver at all. It only reads internal state from the memory manager. I'd recommending rerunning your code with CUDA_LAUNCH_BLOCKING=1
as it's possible that what you're seeing is an exception from an asynchronous CUDA kernel launch.deviceMemInfo
from the C API get an af_err
value back, which is handled accordingly.As for the exception you're seeing, I have a few questions:
@jacobkahn Thanks for your quick reply. Let me follow some of the trails you outlined I will report back shortly.
@jacobkahn I am using the docker file that comes with wav2letter. I just issued the commands below:
git clone --recursive https://github.com/facebookresearch/wav2letter.git cd wav2letter
sudo docker build --no-cache -f ./Dockerfile-CUDA -t wav2letter .
I then followed the instructions to add the LibrisSpeech files, build the lists which are found in https://github.com/facebookresearch/wav2letter/tree/master/tutorials/1-librispeech_clean and initiated the trainimg.
Could you think of any reasons this setup should fail as it is?
The train.cfg file that I am passing as a flag file is the vanilla one that came with wav2letter tutorial and is below
--datadir=/libris/w2l --rundir=/libris/w2l --archdir=/libris/wav2letter/tutorials/1-librispeech_clean/ --train=lists/train-clean-100.lst --valid=lists/dev-clean.lst --input=flac --arch=network.arch --tokens=/libris/w2l/am/tokens.txt --lexicon=/libris/w2l/am/lexicon.txt --criterion=ctc --lr=0.1 --maxgradnorm=1.0 --replabel=1 --surround=| --onorm=target --sqnorm=true --mfsc=true --filterbanks=40 --nthread=4 --batchsize=4 --runname=librispeech_clean_trainlogs --iter=5
So the step parameter you referred to is not in there
Finally
I set environment variable CUDA_LAUNCH_BLOCKING=1 and did not observe any difference.
Could you think of any reason why this vanilla setup would be failing like this? Would I need to do anything in addition to building the docker image?
Hi @mpierre0220,
Could you at first try to use pre-built image and test in it if training is working for you? You can use the following command to run the container from the pre-built image like:
sudo docker run --runtime=nvidia --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-latest
I need some time to recheck if current image build works fine, will send updates on new image build shortly.
@tlikhomanenko Thanks for your input. After the build failed, running the pre-built image was the first thing I tried. Seeing that it failed, I thought that building the image would have updated anything that was downlevel in the pre-built image. I got the same result.
I would be glad to try out any image that you build.
Best
@mpierre0220
When you build a new image with the same Dockerfile
generated image could be different (because we don't specify exact versions for all libraries), that is why there could be errors connected to some system latest updates (for that reason we are building images and give them to community to have no problem with env and these images tested that all tests work).
I had tested this image wav2letter/wav2letter:cuda-latest
before, so please have a try to run it.
@tlikhomanenko I did not sit idle. I wanted to get as much data as possible.
With the fix for #375 (templates added to Sound.cpp) I did not get a relief. Still had a link error.
I could not get it to train neither with the docker image with the cpu backend nor with the cuda backend.
My machine is an older DELL with an i7 core 2md generation with apparently a GPU that is not cuda capable... so I thought that maybe I running into a hardware problem
I got myself a new box with an i7 9th generation and with a GeForce GTX 1660 Ti. I went through all the CUDA installations instructions, docker and nvidia drivers and nvidia docker, libcdnn, but I was never able to get the nvidia runtime to work with docker up. I would get a cuda initialzation error when I try to laumch docker with the nvidia runtime. The cuda utilities are failing as well which tells me that cuda is in a sad shape on my new box. So I gave up using cuda for the time being
I tried running with the cpu backend on the new box and it is training now.
Bottom line is new hardware at least runs with the cpu backend but I am having trouble taking advantage of cuda on it and old hardware fails everywhere. I am yet to build locally and build the docker image. I don' t know how much time the training will take.
Failure with the cpu backend on old hardware
root@00ad4578658d:/libris# /root/wav2letter/build/Train train --flagsfile train.cfg Aborted at 1575581603 (unix time) try "date -d @1575581603" if you are using GNU date PC: @ 0x7fe9e35106f9 mkldnn::impl::get_msec() SIGILL (@0x7fe9e35106f9) received by PID 1496 (TID 0x7fe9e7228bc0) from PID 18446744073228322553; stack trace: @ 0x7fe9dccd7390 (unknown) @ 0x7fe9e35106f9 mkldnn::impl::get_msec() @ 0x7fe9e358e94f mkldnn::impl::cpu::gemm_convolution_fwd_t::pd_t::create_primitive() @ 0x6b833d fl::conv2d() @ 0x695556 fl::Conv2D::forward() @ 0x6a30bf fl::UnaryModule::forward() @ 0x694522 fl::Sequential::forward() @ 0x48202f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_ddbi.constprop.11419 @ 0x41ae88 main @ 0x7fe9dbe4c830 __libc_start_main @ 0x47d329 _start @ 0x0 (unknown) Illegal instruction (core dumped)
Here are some info from my new box
nvidia-smi -a output on the host for new hardware ==============NVSMI LOG==============
Timestamp : Thu Dec 5 16:54:09 2019 Driver Version : 440.33.01 CUDA Version : 10.2
Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GeForce GTX 1660 Ti Product Brand : GeForce Display Mode : Disabled Display Active : Disabled Persistence Mode : Enabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-fb0753a8-37ce-af2e-7fe8-e7c0e6077608 Minor Number : 0 VBIOS Version : 90.16.20.40.DE MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x219110DE Bus Id : 00000000:01:00.0 Sub System Id : 0x08EA1028 GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 1000 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 5944 MiB Used : 592 MiB Free : 5352 MiB BAR1 Memory Usage Total : 256 MiB Used : 7 MiB Free : 249 MiB Compute Mode : Default Utilization Gpu : 4 % Memory : 1 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Temperature GPU Current Temp : 47 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 95 C GPU Max Operating Temp : 87 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : N/A Power Draw : 2.89 W Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 6001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes Process ID : 1184 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 28 MiB Process ID : 1903 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 48 MiB Process ID : 2282 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 177 MiB Process ID : 2489 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 168 MiB Process ID : 30664 Type : G Name : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=15073761657598958878,11892320607599134999,131072 --gpu-preferences=KAAAAAAAAAAgAAAgAAAAAAAAYAAAAAAAEAAAAAAAAAAAAAAAAAAAAAgAAAAAAAAA --service-request-channel-token=1711054654351064489 Used GPU Memory : 165 MiB
nvidia.txt cuda sample mnistCUDNN outpit on the host for new h/w cudnnGetVersion() : 7605 , CUDNN_VERSION from cudnn.h : 7605 (7.6.5) Host compiler version : GCC 7.4.0 Cuda failure Error: unknown error error_util.h:93 Aborting... mnistCUDNN.txt docker version Client: Docker Engine - Community Version: 19.03.5 API version: 1.40 Go version: go1.12.12 Git commit: 633a0ea838 Built: Wed Nov 13 07:29:52 2019 OS/Arch: linux/amd64 Experimental: false
Server: Docker Engine - Community Engine: Version: 19.03.5 API version: 1.40 (minimum version 1.12) Go version: go1.12.12 Git commit: 633a0ea838 Built: Wed Nov 13 07:28:22 2019 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.10 GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339 runc: Version: 1.0.0-rc8+dev GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657 docker-init: Version: 0.18.0 GitCommit: fec3683 docker.txt
Docker output when attempting to run the image with the nvidia runtime on new h/w
marc-arthur@marcarthur-G5-5590:~$ sudo docker run --runtime=nvidia -it --name w2lcuda wav2letter/wav2letter:cuda-latest [sudo] password for marc-arthur: docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: unknown error\\n\\"\"": unknown. ERRO[0014] error waiting for container: context canceled
@mpierre0220 I've experienced the crash you described on your old hardware with the CPU backend. The problem is that when mkl-dnn is built on newer CPUs (at least with default flags) it might crash on older CPUs simply because they do not have support for some of the instruction sets (possibly AVX2 and AVX512). You can compare the CPU differences by running cat /proc/cpuinfo on the respective machines.
The solution is to recompile mkl-dnn on your target machine and then make sure you run make on both flashlight and wav2letter.
@maltium Thanks for the insight. I will follow your suggestions and report back. I am puzzled by the inability of the system to use the CUDA backend since I have the NIVIDIA GTX 1660 Ti which provides CUDA functionality. But I least I can launch training on the CPU backend. Makes for interesting research and permutations. ;-)
I am also Intalling Wav2letter from 0ne week and till now not succeded. Can you help me out through this process. I am Installing it on Ubuntu 20.04 (Cpu). Do i need GPU for Wav2letter?
The Last 2 error are!
cmake .. -DCMAKE_BUILD_TYPE=Release -DW2L_CRITERION_BACKEND=CUDA -DCMAKE_PREFIX_PATH=home/nccs/flashlight/cmake -DArrayFire_DIR=home/nccs/Downloads/arrayfire/share/ArrayFire/cmake -- OpenMP found -- ArrayFire found (include: /opt/arrayfire/include, library: ArrayFire::afcuda) CMake Error at CMakeLists.txt:31 (find_package): By not providing "Findflashlight.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "flashlight", but CMake did not find one.
Could not find a package configuration file provided by "flashlight" with any of the following names:
flashlightConfig.cmake
flashlight-config.cmake
Add the installation prefix of "flashlight" to CMAKE_PREFIX_PATH or set "flashlight_DIR" to a directory containing one of the above files. If "flashlight" provides a separate development package or SDK, be sure it has been installed.
-- Configuring incomplete, errors occurred! See also "/home/nccs/wav2letter/build/CMakeFiles/CMakeOutput.log".
cmake .. -DCMAKE_BUILD_TYPE=Release -DFLASHLIGHT_BACKEND=CUDA -DArrayFire_DIR=/home/nccs/Downloads/arrayfire/share/ArrayFire/cmake -- gtest found: (include: /usr/include, lib: /usr/lib/x86_64-linux-gnu/libgtest.a;/usr/lib/x86_64-linux-gnu/libgtest_main.a -- Checking for [mkl_gf_lp64 - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_gnu_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_gnu_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_gf_lp64 - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_intel_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_gf - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf: not found -- Checking for [mkl_gf - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_lp64.so -- Library mkl_gnu_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_gnu_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_intel_lp64 - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_lp64.so -- Library mkl_intel_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_intel - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel: not found -- Checking for [mkl_intel - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_gnu_thread - mkl_core - pthread - m] -- Library mkl_gf_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_gnu_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_gnu_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library pthread: /usr/lib/x86_64-linux-gnu/libpthread.so -- Library m: /usr/lib/x86_64-linux-gnu/libm.so -- Checking for [mkl_gf_lp64 - mkl_intel_thread - mkl_core - pthread - m] -- Library mkl_gf_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_intel_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library pthread: /usr/lib/x86_64-linux-gnu/libpthread.so -- Library m: /usr/lib/x86_64-linux-gnu/libm.so -- Checking for [mkl_gf - mkl_gnu_thread - mkl_core - pthread - m] -- Library mkl_gf: not found -- Checking for [mkl_gf - mkl_intel_thread - mkl_core - pthread - m] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_gnu_thread - mkl_core - pthread - m] -- Library mkl_intel_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_lp64.so -- Library mkl_gnu_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_gnu_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library pthread: /usr/lib/x86_64-linux-gnu/libpthread.so -- Library m: /usr/lib/x86_64-linux-gnu/libm.so -- Checking for [mkl_intel_lp64 - mkl_intel_thread - mkl_core - pthread - m] -- Library mkl_intel_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_lp64.so -- Library mkl_intel_thread: /home/nccs/intel/mkl/lib/intel64/libmkl_intel_thread.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library pthread: /usr/lib/x86_64-linux-gnu/libpthread.so -- Library m: /usr/lib/x86_64-linux-gnu/libm.so -- Checking for [mkl_intel - mkl_gnu_thread - mkl_core - pthread - m] -- Library mkl_intel: not found -- Checking for [mkl_intel - mkl_intel_thread - mkl_core - pthread - m] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_sequential - mkl_core - m] -- Library mkl_gf_lp64: /home/nccs/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_sequential: /home/nccs/intel/mkl/lib/intel64/libmkl_sequential.so -- Library mkl_core: /home/nccs/intel/mkl/lib/intel64/libmkl_core.so -- Library m: /usr/lib/x86_64-linux-gnu/libm.so -- MKL library found -- CBLAS found (include: /home/nccs/intel/mkl/include, library: /home/nccs/intel/mkl/lib/intel64/libmkl_gf_lp64.so;/home/nccs/intel/mkl/lib/intel64/libmkl_sequential.so;/home/nccs/intel/mkl/lib/intel64/libmkl_core.so;/usr/lib/x86_64-linux-gnu/libm.so) -- FFTW found -- Looking for KenLM -- Using kenlm library found in /home/nccs/kenlm/build/lib/libkenlm.a -- Using kenlm utils library found in /home/nccs/kenlm/build/lib/libkenlm.a -- kenlm lm/model.hh found in /home/nccs/kenlm/lm/model.hh -- Found kenlm (include: /home/nccs/kenlm, library: /home/nccs/kenlm/build/lib/libkenlm.a;/home/nccs/kenlm/build/lib/libkenlm_util.a) CMake Error at cmake/CUDAUtils.cmake:12 (message): CUDA required to build CUDA backend Call Stack (most recent call first): lib/CMakeLists.txt:56 (include) CMakeLists.txt:75 (include)
-- Configuring incomplete, errors occurred! See also "/home/nccs/flashlight/build/CMakeFiles/CMakeOutput.log". See also "/home/nccs/flashlight/build/CMakeFiles/CMakeError.log". nccs@nccs-OptiPlex-3070:~/flashlight/build$ ^C
so Kindly Help.
Before building wav2letter you need successfully install flashlight before. Then in both flashlight and wav2letter you need to install either CUDA backend (which you specified, but for it you need GPU and cmake error in flashlight says to you that CUDA is not there while it is required) or CPU. To build CPU you need to provide -DFLASHLIGHT_BACKEND=CPU
for flashlight and then -DW2L_CRITERION_BACKEND=CPU
.
Closing due to inactivity + solved problems before. Now w2l codebase in the flashlight.
I have spent the last 5 days trying to get wav2letter to work. It has been a frustrating experience.
I tried to build it on my Ubuntu 18.04 LTS. After painstakingly installing all the dependencies (spent 1 day doing that) I ran into the problem filed under #375 (problem in Sound.cpp and W2LBlobs Dataset.cpp) and got stuck there for another day trying to play with the templates and using the workaround outlined in the issue. Nothing worked.
I fell back on the docker image. I ran the image, dowloaded the LibrisSpeech files to try and do a trainimg. I got the problem in this screenshot
After some researches it appears that the problem is in the ArrayFire driver. People suggest that a new driver be installed. This is odd because that ArrayFire code is provided with the image. I looked at the line code that is failing in file arrayfire/src/api/cpp/device.cpp and the function is below: void deviceMemInfo(size_t alloc_bytes, size_t alloc_buffers, size_t lock_bytes, size_t lock_buffers) { AF_THROW(af_device_mem_info(alloc_bytes, alloc_buffers, lock_bytes, lock_buffers)); }
So the function is written to throw an exception by default and unless someone is catching that exception and taking action, this code is doomed to fail.
I went back and built the docker image, got the librispeech files and llaunched the training session again and I got same result.
So the bottom line is I cannot use wav2letter (cannot build it, cannot use the docker image, cannot use the docker image I built). I realize that this is open source and we're at the mercy of people who have a little more knowledge on the stuff than people who are getting acquainted with the code but this is not acceptable. I went to the facebook page and commented on the post of people who are experiencing the problem outlined under #375 and all I got is a like.
If facebook does not have a strategy to provide better help to those who are trying that stuff I am afraid people will get discouraged and just leave it alone until Facebook has a version that is ready for prime time or at least that works decently. It is not right that I spent 5 days to get something working and I still cannot get it working.
What I am thinking is happening is that the thing is working on the developers machines and they forgot to outline all the hoops that had to jump through to get it working and are leaving newcomers to rip their hair to get it to work. It should not be like that.