Open ekdnam opened 3 years ago
cc @tdomhan
@blchu Maybe you can help to take a look too.
Hey there! I have the same problem. Strangely, it worked three months ago on the same docker image and now I'm getting this error.
I have mxnet-cu101==1.7.0
installed and running sockeye from inside an ubuntu docker container.
I tried reinstalling the dependencies, rebuilding the image, reboot the PC, but nothing seems to work. Is there a solution for this issue?
Description
(A clear and concise description of what the bug is.) (Note: Original issue on Sockeye) I am currently following this tutorial on Zero-Shot Translation, the notebook (on Google Colab) can be viewed here
In the training step, for some reason, Sockeye is not able to acquire a GPU
Error Message
(Paste the complete error message. Please also include stack trace by setting environment variable
DMLC_LOG_STACK_TRACE_DEPTH=100
before running your script.) The following output is seen repetitivelyThe entire output is this (I have to interrupt the execution of the kernel)
To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Execute the Colab notebook, which can be viewed here
Environment
We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3
Environment Information
# Paste the diagnose.py command output here ----------Python Info---------- Version : 3.7.11 Compiler : GCC 7.5.0 Build : ('default', 'Jul 3 2021 18:01:19') Arch : ('64bit', '') ------------Pip Info----------- Version : 21.1.3 Directory : /usr/local/lib/python3.7/dist-packages/pip ----------MXNet Info----------- Version : 1.8.0 Directory : /usr/local/lib/python3.7/dist-packages/mxnet Commit hash file "/usr/local/lib/python3.7/dist-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source. Library : ['/usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so'] Build features: ✔ CUDA ✔ CUDNN ✔ NCCL ✔ CUDA_RTC ✖ TENSORRT ✔ CPU_SSE ✔ CPU_SSE2 ✔ CPU_SSE3 ✖ CPU_SSE4_1 ✖ CPU_SSE4_2 ✖ CPU_SSE4A ✖ CPU_AVX ✖ CPU_AVX2 ✔ OPENMP ✖ SSE ✖ F16C ✖ JEMALLOC ✔ BLAS_OPEN ✖ BLAS_ATLAS ✖ BLAS_MKL ✖ BLAS_APPLE ✔ LAPACK ✔ MKLDNN ✔ OPENCV ✖ CAFFE ✖ PROFILER ✔ DIST_KVSTORE ✖ CXX14 ✖ INT64_TENSOR_SIZE ✔ SIGNAL_HANDLER ✖ DEBUG ✖ TVM_OP ----------System Info---------- Platform : Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic system : Linux node : aa12ac7e34fe release : 5.4.104+ version : #1 SMP Sat Jun 5 09:50:34 PDT 2021 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU @ 2.30GHz Stepping: 0 CPU MHz: 2299.998 BogoMIPS: 4599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0,1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, LOAD: 0.4217 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0237 sec, LOAD: 0.0693 sec. Error open Gluon Tutorial(cn): https://zh.gluon.ai,The files mentioned in the notebook (taken from the aforementioned tutorial) can be viewed here.