ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 51 forks source link

Pytorch import is extremely slow #1232

Open Delaunay opened 1 year ago

Delaunay commented 1 year ago

🐛 Describe the bug

$ time python -c "import torch.cuda; print(torch.cuda.is_available())"
True

real    3m36.487s
user    0m4.613s
sys     0m34.452s

Versions

Collecting environment information...

PyTorch version: 2.0.0+rocm5.4.2
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.4.22803-474e8620

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (GCC) 7.4.0
Clang version: Could not collect
CMake version: version 3.26.1
Libc version: glibc-2.31

Python version: 3.9.16 (main, Jan 11 2023, 16:05:54)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-71-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI250X/MI250
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.4.22803
MIOpen runtime version: 2.19.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-255
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7713 64-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         2553.233
CPU max MHz:                     3720.7029
CPU min MHz:                     1500.0000
BogoMIPS:                        3992.55
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-63,128-191
NUMA node1 CPU(s):               64-127,192-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.2
[pip3] pytorch-triton-rocm==2.0.1
[pip3] torch==2.0.0+rocm5.4.2
[pip3] torchaudio==2.0.1+rocm5.4.2
[pip3] torchmetrics==0.11.4
[pip3] torchrec==0.4.0
[pip3] torchvision==0.15.1+rocm5.4.2
[pip3] torchviz==0.0.2
[pip3] torchx==0.5.0
[conda] numpy                     1.24.2                   pypi_0    pypi
[conda] torch                     2.0.0+cu118              pypi_0    pypi
[conda] torchaudio                2.0.1+cu118              pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchrec                  0.4.0                    pypi_0    pypi
[conda] torchvision               0.15.1+cu118             pypi_0    pypi
[conda] torchviz                  0.0.2                    pypi_0    pypi
[conda] torchx                    0.5.0                    pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi
Delaunay commented 1 year ago
python
Python 3.9.16 (main, Jan 11 2023, 16:05:54) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cProfile
>>> import torch.cuda
c>>> cProfile.run('torch.cuda.is_available()')
         14 function calls in 236.054 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  236.054  236.054 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 __init__.py:87(_is_compiled)
        1    0.000    0.000    0.000    0.000 __init__.py:91(_nvml_based_avail)
        1    0.000    0.000  236.054  236.054 __init__.py:94(is_available)
        1    0.000    0.000    0.000    0.000 _collections_abc.py:760(get)
        1    0.000    0.000    0.000    0.000 os.py:674(__getitem__)
        1    0.000    0.000    0.000    0.000 os.py:754(encode)
        1    0.000    0.000    0.000    0.000 os.py:771(getenv)
        1    0.000    0.000  236.054  236.054 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1  236.054  236.054  236.054  236.054 {built-in method torch._C._cuda_getDeviceCount}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'encode' of 'str' objects}
Delaunay commented 1 year ago

Noticed the code was traversing a bunch a directory, put a breakpoint on __NR_newfstatat 262

gdb --args  python -c "import torch.cuda; print(torch.cuda.is_available())"

Thread 1 "python" hit Catchpoint 1 (call to syscall newfstatat), __GI___fxstatat (vers=vers@entry=1, fd=15, 
    file=file@entry=0x6bbb083 "LICENSE.txt", st=st@entry=0x7fffffff89b0, flag=256)
    at ../sysdeps/unix/sysv/linux/wordsize-64/fxstatat.c:43
43      ../sysdeps/unix/sysv/linux/wordsize-64/fxstatat.c: No such file or directory.
(gdb) bt
#0  __GI___fxstatat (vers=vers@entry=1, fd=15, file=file@entry=0x6bbb083 "LICENSE.txt", 
    st=st@entry=0x7fffffff89b0, flag=256) at ../sysdeps/unix/sysv/linux/wordsize-64/fxstatat.c:43
#1  0x00007ffff7d58ee4 in process_entry (data=data@entry=0x7fffffff8b50, dir=dir@entry=0x7fffffff8ab0, 
    name=name@entry=0x6bbb083 "LICENSE.txt", namlen=<optimized out>, d_type=<optimized out>)
    at ../sysdeps/wordsize-64/../../io/ftw.c:408
#2  0x00007ffff7d592e3 in ftw_dir (data=data@entry=0x7fffffff8b50, st=0x7fffffff8bb0, old_dir=old_dir@entry=0x0)
    at ../sysdeps/wordsize-64/../../io/ftw.c:545
#3  0x00007ffff7d59951 in ftw_startup (
    dir=dir@entry=0x7fffffff8cd0 "/network/scratch/d/delaunap/milabench/conda", is_nftw=is_nftw@entry=1, 
    func=func@entry=0x7fff3b37b510 <check_for_location_of_amdgpuids>, descriptors=descriptors@entry=5, 
    flags=flags@entry=1) at ../sysdeps/wordsize-64/../../io/ftw.c:770
#4  0x00007ffff7d59d9e in __new_nftw (
    path=path@entry=0x7fffffff8cd0 "/network/scratch/d/delaunap/milabench/conda", 
    func=func@entry=0x7fff3b37b510 <check_for_location_of_amdgpuids>, descriptors=descriptors@entry=5, 
    flags=flags@entry=1) at ../sysdeps/wordsize-64/../../io/ftw.c:843
#5  0x00007fff3b37b5ee in amdgpu_parse_asic_ids (dev=dev@entry=0x6baf710) at ../amdgpu/amdgpu_asic_id.c:159
#6  0x00007fff3b37e3c7 in amdgpu_device_initialize (fd=<optimized out>, major_version=0x7fffffff9da8, 
    minor_version=0x7fffffff9dac, device_handle=0x7fffffff9db8) at ../amdgpu/amdgpu_device.c:263
#7  0x00007fff631e008b in topology_sysfs_get_node_props ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libhsa-runtime64.so
#8  0x00007fff631e1c34 in topology_take_snapshot ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libhsa-runtime64.so
#9  0x00007fff631e296d in hsaKmtAcquireSystemProperties ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libhsa-runtime64.so
#10 0x00007fff6313c836 in rocr::AMD::BuildTopology() ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libhsa-runtime64.so
#11 0x00007fff6313cdc5 in rocr::AMD::Load() ()
--Type <RET> for more, q to quit, c to continue without paging--
  ite-packages/torch/lib/libhsa-runtime64.so
#12 0x00007fff6315e6a9 in rocr::core::Runtime::Load() ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libhsa-runtime64.so
#13 0x00007fff6315e875 in rocr::core::Runtime::Acquire() ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libhsa-runtime64.so
#14 0x00007fff6313dd0a in rocr::HSA::hsa_init() ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libhsa-runtime64.so
#15 0x00007fffaee89365 in roc::Device::init() () from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libamdhip64.so
#16 0x00007fffaee47a5e in amd::Device::init() () from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libamdhip64.so
#17 0x00007fffaee7ccf6 in amd::Runtime::init() () from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libamdhip64.so
#18 0x00007fffaec413af in hip::init() () from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libamdhip64.so
#19 0x00007fffaec5f985 in hipGetDeviceCount () from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libamdhip64.so
#20 0x00007fffb0457c02 in c10::hip::device_count() () from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libc10_hip.so
#21 0x00007ffff66e4953 in THCPModule_getDeviceCount_wrap(_object*, _object*) ()
   from /tmp/milabench/milabench_dev/result2/venv/torch/lib/python3.9/site-packages/torch/lib/libtorch_python.so
#22 0x00000000004f5c89 in cfunction_vectorcall_NOARGS (func=0x7ffff71b95e0, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.9.16/Objects/methodobject.c:489
#23 0x00000000004ec6f4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff74e4938, callable=0x7ffff71b95e0, tstate=0x76f9b0)
    at /usr/local/src/conda/python-3.9.16/Include/cpython/abstract.h:118
#24 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff74e4938, callable=0x7ffff71b95e0)
    at /usr/local/src/conda/python-3.9.16/Include/cpython/abstract.h:127
#25 call_function (kwnames=0x0, oparg=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, pp_stack=<synthetic pointer>, 
    tstate=0x76f9b0) at /usr/local/src/conda/python-3.9.16/Python/ceval.c:5077
#26 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffff74e47c0, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.16/Python/ceval.c:3489
#27 0x00000000004f80b3 in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, 
    f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, 
    tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/python-3.9.16/Include/internal/pycore_ceval.h:40
#28 function_code_fastcall (tstate=0x76f9b0, co=<optimized out>, args=<optimized out>, nargs=<optimized out>, globals=0x7ffda54a1a40)
    at /usr/local/src/conda/python-3.9.16/Objects/call.c:330
#29 0x00000000004ec6f4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff755b760, callable=0x7ffda5472040, tstate=0x76f9b0)
    at /usr/local/src/conda/python-3.9.16/Include/cpython/abstract.h:118
#30 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff755b760, callable=0x7ffda5472040)
    at /usr/local/src/conda/python-3.9.16/Include/cpython/abstract.h:127
#31 call_function (kwnames=0x0, oparg=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, pp_stack=<synthetic pointer>, 
    tstate=0x76f9b0) at /usr/local/src/conda/python-3.9.16/Python/ceval.c:5077
#32 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffff755b5e0, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.16/Python/ceval.c:3489
#33 0x00000000004e6aba in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, 
    f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, 
    tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/python-3.9.16/Include/internal/pycore_ceval.h:40
#34 _PyEval_EvalCode (tstate=<optimized out>, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=0x0, kwargs=0x0, kwcount=<optimized out>, kwstep=2, defs=0x0, defcount=<optimized out>, kwdefs=0x0, closure=0x0, 
    name=0x0, qualname=0x0) at /usr/local/src/conda/python-3.9.16/Python/ceval.c:4329
#35 0x00000000004e6747 in _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=<optimized out>, kwargs=0x0, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0)
    at /usr/local/src/conda/python-3.9.16/Python/ceval.c:4361
#36 0x00000000004e66f9 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0)
    at /usr/local/src/conda/python-3.9.16/Python/ceval.c:4377
--Type <RET> for more, q to quit, c to continue without paging--
#37 0x00000000005932fb in PyEval_EvalCode (co=co@entry=0x7ffff7594b30, globals=globals@entry=0x7ffff758b780, locals=locals@entry=0x7ffff758b780)
    at /usr/local/src/conda/python-3.9.16/Python/ceval.c:828
#38 0x00000000005c0b77 in run_eval_code_obj (tstate=0x76f9b0, co=0x7ffff7594b30, globals=0x7ffff758b780, locals=0x7ffff758b780)
    at /usr/local/src/conda/python-3.9.16/Python/pythonrun.c:1221
#39 0x00000000005bcb90 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff758b780, locals=0x7ffff758b780, flags=<optimized out>, 
    arena=<optimized out>) at /usr/local/src/conda/python-3.9.16/Python/pythonrun.c:1242
#40 0x00000000005b47ab in PyRun_StringFlags (str=<optimized out>, start=257, globals=0x7ffff758b780, locals=0x7ffff758b780, flags=0x7fffffffb368)
    at /usr/local/src/conda/python-3.9.16/Python/pythonrun.c:1108
#41 0x00000000005b448b in PyRun_SimpleStringFlags (command=0x7ffff7672fb0 "import torch.cuda; print(torch.cuda.is_available())\n", flags=0x7fffffffb368)
    at /usr/local/src/conda/python-3.9.16/Python/pythonrun.c:497
#42 0x00000000005b3ccc in pymain_run_command (cf=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, 
    command=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /usr/local/src/conda/python-3.9.16/Modules/main.c:252
#43 pymain_run_python (exitcode=0x7fffffffb360, exitcode@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/python-3.9.16/Modules/main.c:595
#44 Py_RunMain () at /usr/local/src/conda/python-3.9.16/Modules/main.c:683
#45 0x00000000005873a9 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.9.16/Modules/main.c:1129
#46 0x00007ffff7c6d083 in __libc_start_main (main=0x587360 <main>, argc=3, argv=0x7fffffffb578, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffb568) at ../csu/libc-start.c:308
#47 0x000000000058725e in _start () at /usr/local/src/conda/python-3.9.16/Python/ast.c:5313
Delaunay commented 1 year ago

So that code is causing the slowness https://github.com/pytorch/builder/blob/main/common/install_rocm_drm.sh#L100 For ref the amgpu.ids is located in cat /opt/amdgpu/share/libdrm/amdgpu.ids and this code is traversing the entire python installation to find the file.

Delaunay commented 1 year ago

Additionally that file is just a mapping between the product ID and the marketing ID. The feature should fail fast

Delaunay commented 1 year ago

creating the file right were the function starts to look does not prevent the directory walk.

danny305 commented 1 year ago

Just started using PyTorch-rocm. And import PyTorch is indeed very slow.