Open zhixiaochuan12 opened 4 years ago
This is a line of CuRAND initialization. You'd better to switch to CUDA 10.0.
I'm sorry but this is because the FAISS library required by GraphVite doesn't compile well on CUDA 10.1. Maybe this could be fixed some time later.
You may try conda install -c milagraph -c conda-forge graphvite cudatoolkit=10.0
Thanks for your reply! @KiddoZhu
I changed to the ubuntu18-cudnn7-cuda10.0-python3.7 version docker by running docker pull nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
and installing conda 4.8.3
, but it did not work either.
conda list graphvite
# packages in environment at /usr/local/anaconda3: # # Name Version Build Channel graphvite 0.2.2 py37cuda100hd3e7edd milagraph
The error is nearly the same after running graphvite baseline quick start
running baseline: demo/quick_start.yaml downloading https://www.dropbox.com/s/cf21ouuzd563cqx/BlogCatalog-dataset.zip?dl=1 to BlogCatalog-dataset.zip extracting BlogCatalog-dataset/data/edges.csv from BlogCatalog-dataset.zip to edges.csv converting edges.csv to blogcatalog_graph.txt splitting graph blogcatalog_graph.txt into blogcatalog_train.txt, blogcatalog_valid.txt, blogcatalog_test.txt extracting BlogCatalog-dataset/data/group-edges.csv from BlogCatalog-dataset.zip to group-edges.csv converting group-edges.csv to blogcatalog_label.txt loading graph from /root/.graphvite/dataset/blogcatalog/blogcatalog_train.txt 0.00018755% <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Graph
------------------ Graph ------------------- vertex: 10308, #edge: 327429
as undirected: yes, normalization: no
[time] GraphApplication.load: 0.104428 s
CPU threads is beyond the hardware concurrency
Check failed: error == cudaSuccess CUDA error CUDA driver version is insufficient for CUDA runtime version at /network/home/zhuzhaoc/.local/envs/build/conda-bld/graphvite_1584598935508/work/include/core/solver.h:203 Check failure stack trace: @ 0x7f9f72dbb4dd google::LogMessage::Fail() @ 0x7f9f72dc3071 google::LogMessage::SendToLog() @ 0x7f9f72dbaecd google::LogMessage::Flush() @ 0x7f9f72dbc76a google::LogMessageFatal::~LogMessageFatal() @ 0x7f9f76f91fdf graphvite::CudaCheck() @ 0x7f9f770f4d64 graphvite::SolverMixin<>::SolverMixin() @ 0x7f9f7713af1d _ZZN8pybind1112cpp_function10initializeIZNS_6detail8initimpl11constructorIJSt6vectorIiSaIiEEimEE7executeINS_6class_IN9graphvite11GraphSolverILm128EfjEEJEEEJNS_10call_guardIJNS_18gil_scoped_releaseEEEENS_5arg_vESI_SI_ELi0EEEvRT_DpRKT0_EUlRNS2_16value_and_holderES7_imE_vJSQ_S7_imEJNS_4nameENS_9is_methodENS_7siblingENS2_24is_new_style_constructorESH_SI_SI_SI_EEEvOSJ_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4FUNES17 @ 0x7f9f7704e529 pybind11::cpp_function::dispatcher() @ 0x55c62ccb9a55 _PyMethodDef_RawFastCallDict @ 0x55c62ccb9bd1 _PyCFunction_FastCallDict @ 0x55c62ccb7943 _PyObject_Call_Prepend @ 0x55c62ccaab9e PyObject_Call @ 0x55c62cc37726 slot_tp_init.cold.2021 @ 0x55c62ccf6bc7 type_call @ 0x55c62ccaab9e PyObject_Call @ 0x55c62cd439f5 _PyEval_EvalFrameDefault @ 0x55c62cc9b6f9 _PyEval_EvalCodeWithName @ 0x55c62cc9c805 _PyFunction_FastCallDict @ 0x55c62ccb79be _PyObject_Call_Prepend @ 0x55c62ccf938e slot_tp_new @ 0x55c62ccf6dc9 _PyObject_FastCallKeywords @ 0x55c62cd46e8f _PyEval_EvalFrameDefault @ 0x55c62cc9b6f9 _PyEval_EvalCodeWithName @ 0x55c62cc9ca30 _PyFunction_FastCallDict @ 0x55c62ccb7943 _PyObject_Call_Prepend @ 0x55c62ccaab9e PyObject_Call @ 0x55c62cd439f5 _PyEval_EvalFrameDefault @ 0x55c62cc9b6f9 _PyEval_EvalCodeWithName @ 0x55c62cc9ca30 _PyFunction_FastCallDict @ 0x55c62cd439f5 _PyEval_EvalFrameDefault @ 0x55c62cc9c030 _PyEval_EvalCodeWithName @ 0x55c62cc9ca30 _PyFunction_FastCallDict Aborted
Could you check any of these two?
nvcc -V
the CUDA version in the docker
ldd libgraphvite.so | grep cuda
the CUDA libraries GraphVite is linked to
libgraphvite.so
should be somewhere under /lib/python3.7/site-packages/graphvite/
Sure.
nvcc -V
-->
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130
the libgraphvite.so
is in the dir /usr/local/anaconda3/lib/python3.7/site-packages/graphvite/lib
.
ldd libgraphvite.so | grep cuda
got the output:
libcudart.so.10.0 => /usr/local/anaconda3/lib/python3.7/site-packages/graphvite/lib/./../../../../libcudart.so.10.0 (0x00007fcc65a66000)
After going into the dir /usr/local/anaconda3/lib/python3.7/site-packages/graphvite/lib/./../../../../
, run ls | grep cuda
-->
cudatoolkit_config.yaml libcudart.so libcudart.so.10.0 libcudart.so.10.0.130 libicudata.a libicudata.so libicudata.so.58 libicudata.so.58.2
@KiddoZhu hi, do you have any other suggestions?
I also get aborted, but after the training.
$ graphvite baseline quick start
running baseline: demo/quick_start.yaml
loading graph from /home/ben/.graphvite/dataset/blogcatalog/blogcatalog_train.txt
0.00018755%
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Graph<uint32>
------------------ Graph -------------------
#vertex: 10308, #edge: 327429
as undirected: yes, normalization: no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[time] GraphApplication.load: 0.0789478 s
[time] GraphApplication.build: 0.665795 s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<128, float32, uint32>
----------------- Resource -----------------
#worker: 1, #sampler: 7, #partition: 1
tied weights: no, episode size: 500
gpu memory limit: 23.3 GiB
gpu memory cost: 51.5 MiB
----------------- Sampling -----------------
augmentation step: 2, shuffle base: 2
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: LINE
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Batch id: 0 / 6548
loss = 0
Batch id: 1000 / 6548
loss = 0.387679
Batch id: 2000 / 6548
loss = 0.38329
Batch id: 3000 / 6548
loss = 0.379579
Batch id: 4000 / 6548
loss = 0.37587
Batch id: 5000 / 6548
loss = 0.372847
Batch id: 6000 / 6548
loss = 0.371051
[time] GraphApplication.train: 7.58616 s
------------- link prediction --------------
terminate called after throwing an instance of 'std::runtime_error'
what(): generic_type: cannot initialize type "WorkerId": an object with that name is already defined
Aborted (core dumped)
I think this is a fantastic library, but I burned through some time already trying to set up graphvite with different settings. A minimal example of how to make this run would really go a long way. How do I make this work under docker? A poster above suggested something along the lines of
docker pull nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
...but that didn't work. Could you suggest a different image instead?
I tried again after removing pytorch and reinstalling (instead of just -U upgrading). Now I am getting something more promising, but still not there.
$ graphvite baseline quick startrunning baseline: demo/quick_start.yamlloading graph from /home/ben/.graphvite/dataset/blogcatalog/blogcatalog_train.txt
0.00018755%
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Graph<uint32>
------------------ Graph -------------------
#vertex: 10308, #edge: 327429
as undirected: yes, normalization: no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[time] GraphApplication.load: 0.0757065 s
[time] GraphApplication.build: 0.659924 s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<128, float32, uint32>
----------------- Resource -----------------
#worker: 1, #sampler: 7, #partition: 1
tied weights: no, episode size: 500
gpu memory limit: 23.3 GiB
gpu memory cost: 51.5 MiB
----------------- Sampling -----------------
augmentation step: 2, shuffle base: 2
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: LINE
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Batch id: 0 / 6548
loss = 0
Batch id: 1000 / 6548
loss = 0.38787
Batch id: 2000 / 6548
loss = 0.383354
Batch id: 3000 / 6548
loss = 0.379194
Batch id: 4000 / 6548
loss = 0.375888
Batch id: 5000 / 6548
loss = 0.372947
Batch id: 6000 / 6548
loss = 0.370918
[time] GraphApplication.train: 7.64435 s
------------- link prediction --------------
effective edges: 6645 / 6650
effective filter edges: 327429 / 327429
remaining edges: 6645 / 6645
Traceback (most recent call last):
File "/usr/local/bin/graphvite", line 11, in <module>
load_entry_point('graphvite==0.2.2', 'console_scripts', 'graphvite')()
File "/usr/local/lib/python3.7/site-packages/graphvite/cmd.py", line 273, in main
command[args.command](args)
File "/usr/local/lib/python3.7/site-packages/graphvite/cmd.py", line 235, in baseline_main
app.evaluate(**evaluation)
File "/usr/local/lib/python3.7/site-packages/graphvite/util.py", line 162, in wrapper
result = function(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/graphvite/application/application.py", line 124, in evaluate
result = getattr(self, func_name)(**kwargs)
File "/usr/local/lib/python3.7/site-packages/graphvite/application/application.py", line 436, in link_prediction
model = model.cuda()
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 307, in cuda
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 225, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 307, in <lambda>
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 149, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 63, in _check_driver
of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.
Is there a way to conda (or pip) install a graphvite that works with my version, so I don't have to go through the painful CUDA re-installation? I noticed you have many files on conda-forge that correspond to different versions of cuda.
@benman1
We didn't try to install GraphVite in a Docker. For conda installation, I think it should be fine as long as you explicitly specify python and CUDA version, like conda install -c conda-forge graphvite python=3.7 cudatoolkit=10.0
According to your last log, GraphVite seems working correctly and the error is in the PyTorch evaluation steps. Maybe it's because the default PyTorch installed by conda is too new for the driver. Could you try to specify some version for PyTorch in conda install
? GraphVite should work with any version >= 1.0
I installed graphvite in an ubuntu18-cudnn7-cuda10.1-python3.7 version docker.
conda version is
conda 4.8.3
conda list graphvite
got the following result:I ran
graphvite baseline quick start
and got the error:I did not modify anything after cloning the codes. What are the possible problems?