Error importing DGL with TF backend

Rohanjames1997 commented 4 years ago

🐛 Bug

FileNotFoundError: dgl.dll, even though it exists in the said directory.

To Reproduce

Steps to reproduce the behavior:

Installed tensorflow 2.2 via pip inside a conda environment
Installed dgl via pip inside the same environment
Changed the backend of dgl to tensorflow (I don't think this has any bearing)

` >>>import dgl Traceback (most recent call last): File "", line 1, in File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl__init.py", line 8, in from .backend import load_backend, backend_name File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl\backend__init.py", line 74, in load_backend(get_preferred_backend()) File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl\backend\init__.py", line 23, in load_backend mod = importlib.import_module('.%s' % mod_name, name) File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\importlib__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl\backend\tensorflow__init__.py", line 4, in from .tensor import * File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl\backend\tensorflow\tensor.py", line 12, in from ... import ndarray as nd File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl\ndarray.py", line 14, in from ._ffi.object import register_object, ObjectBase File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl_ffi\object.py", line 8, in from .object_generic import ObjectGeneric, convert_to_object File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl_ffi\object_generic.py", line 7, in from .base import string_types File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl_ffi\base.py", line 42, in _LIB, _LIB_NAME = _load_lib() File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl_ffi\base.py", line 34, in _load_lib lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL) File "C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\ctypes\init.py", line 373, in init__ self._handle = _dlopen(self._name, mode) FileNotFoundError: Could not find module 'C:\Users\Rohan\anaconda3\envs\tf_dgl\lib\site-packages\dgl\dgl.dll' (or one of its dependencies). Try using the full path with constructor syntax.

`

Environment

DGL Version 0.43 (latest)
Backend Library & Version: Tensorflow2.2
OS : Windows 10
How you installed DGL ( pip, inside a conda env):
Python version: 3.8
CUDA 10.1

Additional context

After checking the directory for the missing file, it was indeed there! But the error persisted. Conda and lower versions of Python do not support TF 2.2.

VoVAllen commented 4 years ago

Hi,

For windows now dgl support is a bit tricky. Please try the following steps:

Install tf-nightly instead of other tensorflow version. Because the function we needed is only available in the latest nightly build. (And this would be available in tensorflow 2.2 official release)

set the environment variable USE_OFFICIAL_TFDLPACK to true.

import os
os.env['USE_OFFICIAL_TFDLPACK'] = "true"
# then import dgl or other codes
import dgl

Rohanjames1997 commented 4 years ago

Hello, Thanks for your reply @VoVAllen .

Unfortunately, the error persists after installing tf-nightly too. Setting the environment variable USE_OFFICIAL_TFDLPACK to true did not make any difference. This is because the error is still due to the line 12 in tensor.py: from ... import ndarray as nd Would waiting for the official tf 2,2 release be helpful in this case?

Thank you.

VoVAllen commented 4 years ago

Could you post the detailed error? I tested it works at my side

bhavaygg commented 3 years ago

@VoVAllen I am facing a similar error and the tf version is 2.3.1. It was working earlier but i installed the CUDA version of dgl using conda install -c dglteam dgl-cuda11.0 and im getting the following error.

from dgllife.model.model_zoo import GCNPredictor
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\__init__.py", line 9, in <module>
    from . import model
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\model\__init__.py", line 6, in <module>
    from .gnn import *
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\model\gnn\__init__.py", line 8, in <module>
    from .attentivefp import *
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\model\gnn\attentivefp.py", line 9, in <module>
    import dgl.function as fn
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\__init__.py", line 14, in <module>
    from .backend import load_backend, backend_name
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\backend\__init__.py", line 73, in <module>
    load_backend(get_preferred_backend())
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\backend\__init__.py", line 23, in load_backend
    mod = importlib.import_module('.%s' % mod_name, __name__)
  File "D:\Anaconda\envs\myenv\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\backend\pytorch\__init__.py", line 1, in <module>
    from .tensor import *
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\backend\pytorch\tensor.py", line 11, in <module>
    from ... import ndarray as nd
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\ndarray.py", line 14, in <module>
    from ._ffi.object import register_object, ObjectBase
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\_ffi\object.py", line 8, in <module>
    from .object_generic import ObjectGeneric, convert_to_object
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\_ffi\object_generic.py", line 7, in <module>
    from .base import string_types
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\_ffi\base.py", line 42, in <module>
    _LIB, _LIB_NAME = _load_lib()
  File "D:\Anaconda\envs\myenv\lib\site-packages\dgl\_ffi\base.py", line 34, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
  File "D:\Anaconda\envs\myenv\lib\ctypes\__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'D:\Anaconda\envs\myenv\lib\site-packages\dgl\dgl.dll' (or one of its dependencies). Try using the full path with constructor syntax.

BarclayII commented 3 years ago

Hi,

The reason may be that the file itself or one of the dependencies (e.g. CUDA 11.0 library) is missing. A likely case is that you did not install CUDA 11.0 through NVIDIA installer so the system cannot find it.

Could you check if the file D:\Anaconda\envs\myenv\lib\site-packages\dgl\dgl.dll exists? If so, could you check if the dependencies are fulfilled? You can drag the DLL file into Dependencies.exe and see if there is any question mark.

bhavaygg commented 3 years ago

Hi, i had previously install CUDA 11.2 and after installing 11.0 the issue persists. The D:\Anaconda\envs\myenv\lib\site-packages\dgl\dgl.dll file exists. Using the Dependencies.exe i get We could not find api-ms-win-core-wow64-l1-1-0.dll file on the disk anymore.

bhavaygg commented 3 years ago

@BarclayII

BarclayII commented 3 years ago

Could you try installing Visual C++ 2017 redistributable? Also were you running Windows 10? @Chokerino

bhavaygg commented 3 years ago

@BarclayII yes im on windows 10. I switched to cuda 10.2 to run the 0.4.3 version of dgl and the dependencies now show 3 files missing. cublas64_10.dll, cusparse64_10.dll and the api-ms-win-core-wow64-l1-1-0.dll as before. This is after installing the Visual C++ 2017 redistributable.

BarclayII commented 3 years ago

For CUDA 11.0 you might need Visual C++ 2019 redistributable.

bhavaygg commented 3 years ago

@BarclayII I have installed both as the redistributable files are the same. I have also tried to manually add the files. cublas64_10.dll and cusparse64_10.dll get successfully added but even after putting the api-ms-win-core-wow64-l1-1-0.dll file in the System32 folder shows an error in dependencies.

yutaoming commented 3 years ago

If someone encounters this problem, it may be that the version of cuda and the version of dgl do not match.

marijnvk commented 3 years ago

Unfortunately, I have run into the same problem. Perhaps these system settings can help pin down the problem:

Windows 10 CUDA 11.1 cuDNN 8

Python 3.9.5 TensorFlow 2.5.0 PyTorch 1.8.1 DGL 0.6.1 (cu111)

I have installed everything into a venv environment with pip (I'm not using conda). Importing and running PyTorch and TensorFlow by themselves works without a hitch, including GPU capability (my PATH is set up with both CUDA and cuDNN directories as required). Running DGL with DGLBACKEND=pytorch also appears to run smoothly (which I assume is using the GPU libraries that come bundled with PyTorch). However, when I set DGLBACKEND=tensorflow, the same error as the original issue here occurs. I checked the dependencies of dgl.dll with the Dependecies application, and verified that each of them could be loaded manually in python (using ctypes.CDLL(...)). I've pasted the error I get below for completion. As you can see, DGL is able to find at least cudart64_110.dll, which is part of the CUDA 11.1 distribution.


(.venv) C:\Users\s092292\Desktop\rcpsp>set DGLBACKEND=tensorflow
(.venv) C:\Users\s092292\Desktop\rcpsp>python
Python 3.9.5 (tags/v3.9.5:0a7dcbd, May  3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import dgl
2021-06-03 20:13:17.729320: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\s092292\Desktop\rcpsp\.venv\lib\site-packages\dgl\__init__.py", line 13, in <module>
    from .backend import load_backend, backend_name
  File "C:\Users\s092292\Desktop\rcpsp\.venv\lib\site-packages\dgl\backend\__init__.py", line 95, in <module>
    load_backend(get_preferred_backend())
  File "C:\Users\s092292\Desktop\rcpsp\.venv\lib\site-packages\dgl\backend\__init__.py", line 41, in load_backend
    from .._ffi.base import load_tensor_adapter # imports DGL C library
  File "C:\Users\s092292\Desktop\rcpsp\.venv\lib\site-packages\dgl\_ffi\base.py", line 44, in <module>
    _LIB, _LIB_NAME, _DIR_NAME = _load_lib()
  File "C:\Users\s092292\Desktop\rcpsp\.venv\lib\site-packages\dgl\_ffi\base.py", line 34, in _load_lib
    lib = ctypes.CDLL(lib_path[0])
  File "C:\Users\s092292\AppData\Local\Programs\Python\Python39\lib\ctypes\__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\s092292\Desktop\rcpsp\.venv\lib\site-packages\dgl\dgl.dll' (or one of its dependencies). Try using the full path with constructor syntax.

VoVAllen commented 3 years ago

@marijnvk We are not sure about the root cause, this error usually occurs when the dependent library cannot be found. Some possible causes:

TF version incompatibility. Could you try older tensorflow such as 2.3 or 2.2?
Cuda version incompatibility, that tf depends on a different cuda version comparing with pytorch Could you try to verify those?

marijnvk commented 3 years ago

PyTorch and TensorFlow work fine when run by themselves, so I doubt it is a CUDA version incompatibility issue in those libraries. It's a bit of a hassle, but I'll try downgrading TensorFlow to 2.3 (which will require downgrading CUDA to 10 and cuDNN to 7) and see if that works.

Do you have perhaps the list of DLLs that dgl-cu111 is supposed to load directly (i.e. not through PyTorch/TensorFlow)? I have no idea if the Dependecies application picks up everything.

VoVAllen commented 3 years ago

@marijnvk tensorflow had its own dynamic library loading system, which may prevent dgl finding the related library on windows. However I'm not sure about this. Could you try import tensorflow before import dgl?

marijnvk commented 3 years ago

That's much quicker to check, The result is unfortunately the same. Importing TensorFlow first and then DGL still produces the same error.

marijnvk commented 3 years ago

Okay, so I've figured out what is going wrong here. It actually doesn't have anything to do with TensorFlow, CUDA, cuDNN, or version mismatches at all. This is caused by changes introduced in Python 3.8 (I'm on 3.9.5). The core of the issue is this, a change in the directories that Python considers by default when looking for DLLs. Notably, starting from this version of Python, the PATH environment variable is no longer included by default (same goes for the current working directory, by the way). A new function is provided to add directories to the list that is searched for DLLs securely. Hacking the following into the start of the _load_lib function of _ffi\base.py fixed it for me:

os.add_dll_directory("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.1\\bin")
os.add_dll_directory("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.1\\libnvvp")
os.add_dll_directory("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.1\\extras\\CUPTI\\lib64")
os.add_dll_directory("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.1\\include")
os.add_dll_directory("C:\\tools\\cuda\\bin") # cuDNN

It probably doesn't need all of these, but I just slapped in all related directories that were in my PATH. I'm assuming that TensorFlow accounts for this change in Python functionality, hence the message before the error saying that cudart64_110.dll is loaded. But when it's DGL's turn, it doesn't consider the directories in PATH, meaning it cannot find CUDA and cuDNN. Note that this is also why this was not picked up by the Dependencies application, since that one does appear to check the PATH. This also explans the originally reported issue, which was on Python 3.8.

Probably the clean way to do this is to loop over the directories in PATH when DGL is first loaded, and add relevant directories with that new function one by one. Or introduce a new environment variable to set these directories (or one that informs DGL whether or not to use PATH).

VoVAllen commented 3 years ago

@marijnvk Thanks for your detailed investigation! We'll check how other frameworks handle this to find a better solution

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] commented 2 years ago

This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.

dmlc / dgl