apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

After the interruption, there are many zombie processes that cannot be killed. #14660

Open Hughen opened 5 years ago

Hughen commented 5 years ago

Description

After interrupting train.py, there are many zombie processes that can not be killed. It seems that the gpu tasks is not being recycled properly.

Environment info (Required)

----------Python Info----------                                                                                                                                                                                   
('Version      :', '2.7.12')                                                                                                                                                                                      
('Compiler     :', 'GCC 5.4.0 20160609')                                                                                                                                                                          
('Build        :', ('default', 'Nov 12 2018 14:36:49'))                                                                                                                                                           
('Arch         :', ('64bit', 'ELF'))                                                                                                                                                                              
------------Pip Info-----------                                                                                                                                                                                   
('Version      :', '18.1')                                                                                                                                                                                        
('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')                                                                                                                                                  
----------MXNet Info-----------                                                                                                                                                                                   
('Version      :', '1.3.1')                                                                                                                                                                                       
('Directory    :', '/usr/local/lib/python2.7/dist-packages/mxnet')                                                                                                                                                
('Commit Hash   :', '19c501680183237d52a862e6ae1dc4ddc296305b')                                                                                                                                                   
----------System Info----------                                                                                                                                                                                   
('Platform     :', 'Linux-4.14.74-coreos-x86_64-with-Ubuntu-16.04-xenial')                                                                                                                                        
('system       :', 'Linux')                                                                                                                                                                                       
('node         :', 'kindle-zhiyuan-mxnet')                                                                                                                                                                        
('release      :', '4.14.74-coreos')                                                                                                                                                                              
('version      :', '#1 SMP Mon Oct 22 22:12:42 UTC 2018')                                                                                                                                                         
----------Hardware Info----------                                                                                                                                                                                 
('machine      :', 'x86_64')                                                                                                                                                                                      
('processor    :', 'x86_64')                                                                                                                                                                                      
Architecture:          x86_64                                                                                                                                                                                     
CPU op-mode(s):        32-bit, 64-bit                                                                                                                                                                             
Byte Order:            Little Endian                                                                                                                                                                              
CPU(s):                64                                                                                                                                                                                         
On-line CPU(s) list:   0-63                                                                                                                                                                                       
Thread(s) per core:    2                                                                                                                                                                                          
Core(s) per socket:    16                                                                                                                                                                                         
Socket(s):             2                                                                                                                                                                                          
NUMA node(s):          2                                                                                                                                                                                          
Vendor ID:             GenuineIntel                                                                                                                                                                               
CPU family:            6                                                                                                                                                                                          
Model:                 85                                                                                                                                                                                         
Model name:            Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz                                                                                                                                                   
Stepping:              4                                                                                                                                                                                          
CPU MHz:               3146.501                                                                                                                                                                                   
CPU max MHz:           2101.0000                                                                                                                                                                                  
CPU min MHz:           1000.0000                                                                                                                                                                                  
BogoMIPS:              4206.43                                                                                                                                                                                    
Virtualization:        VT-x                                                                                                                                                                                       
Hypervisor vendor:     vertical                                                                                                                                                                                   
Virtualization type:   full                                                                                                                                                                                       
L1d cache:             32K                                                                                                                                                                                        
L1i cache:             32K                                                                                                                                                                                        
L2 cache:              1024K                                                                                                                                                                                      
L3 cache:              22528K                                                                                                                                                                                     
NUMA node0 CPU(s):     0-15,32-47                                                                                                                                                                                 
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms i
nvpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln
pts pku ospke flush_l1d
----------Network Test----------
Setting timeout: 10
Error open MXNet: https://github.com/apache/incubator-mxnet, <urlopen error timed out>, DNS finished in 0.0109460353851 sec.                                                                                      
Error open PYPI: https://pypi.python.org/pypi/pip, <urlopen error [Errno 99] Cannot assign requested address>, DNS finished in 2.49472808838 sec.                                                                 
Error open FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, <urlopen error [Errno 99] Cannot assign requested address>, DNS finishe
d in 0.435553073883 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, <urlopen error [Errno 99] Cannot assign requested address>, DNS finished in 1.36981892586 sec.                                                            
Error open Gluon Tutorial(en): http://gluon.mxnet.io, <urlopen error timed out>, DNS finished in 6.36236405373 sec.                                                                                               
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error timed out>, DNS finished in 3.11570119858 sec.

Error Message:

And dmesg log has a stack error like this has occurred:

[11975464.417683] INFO: task train_stereo.py:25298 blocked for more than 120 seconds.
[11975464.425977]       Tainted: P           OE   4.14.74-coreos #1
[11975464.432405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11975464.441216] train_stereo.py D    0 25298  24705 0x00000084
[11975464.447368] Call Trace:
[11975464.450484]  ? __schedule+0x28e/0x890
[11975464.454798]  schedule+0x28/0x80
[11975464.458611]  schedule_preempt_disabled+0xa/0x10
[11975464.463794]  __mutex_lock.isra.2+0x18c/0x4d0
[11975464.468969]  ? _nv009689rm+0xb0/0xf0 [nvidia]
[11975464.474000]  ? uvm_gpu_retain_by_uuid+0x19/0x40 [nvidia_uvm]
[11975464.480345]  uvm_gpu_retain_by_uuid+0x19/0x40 [nvidia_uvm]
[11975464.486513]  uvm_va_space_register_gpu+0x29/0x370 [nvidia_uvm]
[11975464.493020]  uvm_unlocked_ioctl+0x8da/0xdb0 [nvidia_uvm]
[11975464.498991]  ? filemap_map_pages+0x31f/0x340
[11975464.503910]  ? __handle_mm_fault+0xe2b/0x1290
[11975464.508947]  ? do_vfs_ioctl+0xa4/0x630
[11975464.513346]  do_vfs_ioctl+0xa4/0x630
[11975464.517589]  ? security_file_ioctl+0x44/0x60
[11975464.522525]  SyS_ioctl+0x74/0x80
[11975464.526412]  do_syscall_64+0x67/0x120
[11975464.530724]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[11975464.536446] RIP: 0033:0x7f2cec75df47
[11975464.540672] RSP: 002b:00007ffc9b3d8338 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[11975464.549211] RAX: ffffffffffffffda RBX: 00007f2bd7f7ec48 RCX: 00007f2cec75df47
[11975464.557372] RDX: 00007ffc9b3d8360 RSI: 0000000000000025 RDI: 000000000000001d
[11975464.565479] RBP: 0000000002fa4685 R08: 0000000000000081 R09: 00007f2bd7f7ec48
[11975464.573575] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2bd7f7ebc0
[11975464.581666] R13: 00007f2bd7fedce0 R14: 0000000000000001 R15: 00007f2bd7fefb70
mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Bug

vrakesh commented 5 years ago

@Hughen Thank you for sharing the issue, requesting to provide more details and train.py so that we can look into reproducing it @mxnet-label-bot add [Pending requester info]