DeepGraphLearning / graphvite

GraphVite: A General and High-performance Graph Embedding System
https://graphvite.io
Apache License 2.0
1.22k stars 151 forks source link

Aborted (core dumped) #57

Open zhixiaochuan12 opened 4 years ago

zhixiaochuan12 commented 4 years ago

I installed graphvite in an ubuntu18-cudnn7-cuda10.1-python3.7 version docker.

conda version is conda 4.8.3

conda list graphvite got the following result:

# packages in environment at /usr/local/anaonda3: # # Name Version Build Channel graphvite 0.2.2 py37cuda101hd3e7edd milagraph

I ran graphvite baseline quick start and got the error:

running baseline: demo/quick_start.yaml downloading https://www.dropbox.com/s/cf21ouuzd563cqx/BlogCatalog-dataset.zip?dl=1 to BlogCatalog-dataset.zip extracting BlogCatalog-dataset/data/edges.csv from BlogCatalog-dataset.zip to edges.csv converting edges.csv to blogcatalog_graph.txt splitting graph blogcatalog_graph.txt into blogcatalog_train.txt, blogcatalog_valid.txt, blogcatalog_test.txt extracting BlogCatalog-dataset/data/group-edges.csv from BlogCatalog-dataset.zip to group-edges.csv converting group-edges.csv to blogcatalog_label.txt loading graph from /home/hadoop-aipnlp/.graphvite/dataset/blogcatalog/blogcatalog_train.txt 0.00018755% <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Graph ------------------ Graph -------------------

vertex: 10308, #edge: 327429

as undirected: yes, normalization: no

[time] GraphApplication.load: 0.109521 s Check failed: error == CURAND_STATUS_SUCCESS CURAND error 203 at /network/home/zhuzhaoc/.local/envs/build/conda-bld/graphvite_1584598935508/work/include/core/solver.h:950 Check failure stack trace: @ 0x7f58fe9994dd google::LogMessage::Fail() @ 0x7f58fe9a1071 google::LogMessage::SendToLog() @ 0x7f58fe998ecd google::LogMessage::Flush() @ 0x7f58fe99a76a google::LogMessageFatal::~LogMessageFatal() @ 0x7f5902a68417 graphvite::CurandCheck() @ 0x7f5902bcb8d4 graphvite::SolverMixin<>::SolverMixin() @ 0x7f5902c11f1d _ZZN8pybind1112cpp_function10initializeIZNS_6detail8initimpl11constructorIJSt6vectorIiSaIiEEimEE7executeINS_6class_IN9graphvite11GraphSolverILm128EfjEEJEEEJNS_10call_guardIJNS_18gil_scoped_releaseEEEENS_5arg_vESI_SI_ELi0EEEvRT_DpRKT0_EUlRNS2_16value_and_holderES7_imE_vJSQ_S7_imEJNS_4nameENS_9is_methodENS_7siblingENS2_24is_new_style_constructorESH_SI_SI_SI_EEEvOSJ_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4FUNES17 @ 0x7f5902b25529 pybind11::cpp_function::dispatcher() @ 0x55bcef415a55 _PyMethodDef_RawFastCallDict @ 0x55bcef415bd1 _PyCFunction_FastCallDict @ 0x55bcef413943 _PyObject_Call_Prepend @ 0x55bcef406b9e PyObject_Call @ 0x55bcef393726 slot_tp_init.cold.2021 @ 0x55bcef452bc7 type_call @ 0x55bcef406b9e PyObject_Call @ 0x55bcef49f9f5 _PyEval_EvalFrameDefault @ 0x55bcef3f76f9 _PyEval_EvalCodeWithName @ 0x55bcef3f8805 _PyFunction_FastCallDict @ 0x55bcef4139be _PyObject_Call_Prepend @ 0x55bcef45538e slot_tp_new @ 0x55bcef452dc9 _PyObject_FastCallKeywords @ 0x55bcef4a2e8f _PyEval_EvalFrameDefault @ 0x55bcef3f76f9 _PyEval_EvalCodeWithName @ 0x55bcef3f8a30 _PyFunction_FastCallDict @ 0x55bcef413943 _PyObject_Call_Prepend @ 0x55bcef406b9e PyObject_Call @ 0x55bcef49f9f5 _PyEval_EvalFrameDefault @ 0x55bcef3f76f9 _PyEval_EvalCodeWithName @ 0x55bcef3f8a30 _PyFunction_FastCallDict @ 0x55bcef49f9f5 _PyEval_EvalFrameDefault @ 0x55bcef3f8030 _PyEval_EvalCodeWithName @ 0x55bcef3f8a30 _PyFunction_FastCallDict Aborted (core dumped)

I did not modify anything after cloning the codes. What are the possible problems?

KiddoZhu commented 4 years ago

This is a line of CuRAND initialization. You'd better to switch to CUDA 10.0.

I'm sorry but this is because the FAISS library required by GraphVite doesn't compile well on CUDA 10.1. Maybe this could be fixed some time later.

KiddoZhu commented 4 years ago

You may try conda install -c milagraph -c conda-forge graphvite cudatoolkit=10.0

zhixiaochuan12 commented 4 years ago

Thanks for your reply! @KiddoZhu

I changed to the ubuntu18-cudnn7-cuda10.0-python3.7 version docker by running docker pull nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 and installing conda 4.8.3, but it did not work either.

conda list graphvite

# packages in environment at /usr/local/anaconda3: # # Name Version Build Channel graphvite 0.2.2 py37cuda100hd3e7edd milagraph

The error is nearly the same after running graphvite baseline quick start

running baseline: demo/quick_start.yaml downloading https://www.dropbox.com/s/cf21ouuzd563cqx/BlogCatalog-dataset.zip?dl=1 to BlogCatalog-dataset.zip extracting BlogCatalog-dataset/data/edges.csv from BlogCatalog-dataset.zip to edges.csv converting edges.csv to blogcatalog_graph.txt splitting graph blogcatalog_graph.txt into blogcatalog_train.txt, blogcatalog_valid.txt, blogcatalog_test.txt extracting BlogCatalog-dataset/data/group-edges.csv from BlogCatalog-dataset.zip to group-edges.csv converting group-edges.csv to blogcatalog_label.txt loading graph from /root/.graphvite/dataset/blogcatalog/blogcatalog_train.txt 0.00018755% <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Graph ------------------ Graph -------------------

vertex: 10308, #edge: 327429

as undirected: yes, normalization: no

[time] GraphApplication.load: 0.104428 s

CPU threads is beyond the hardware concurrency

Check failed: error == cudaSuccess CUDA error CUDA driver version is insufficient for CUDA runtime version at /network/home/zhuzhaoc/.local/envs/build/conda-bld/graphvite_1584598935508/work/include/core/solver.h:203 Check failure stack trace: @ 0x7f9f72dbb4dd google::LogMessage::Fail() @ 0x7f9f72dc3071 google::LogMessage::SendToLog() @ 0x7f9f72dbaecd google::LogMessage::Flush() @ 0x7f9f72dbc76a google::LogMessageFatal::~LogMessageFatal() @ 0x7f9f76f91fdf graphvite::CudaCheck() @ 0x7f9f770f4d64 graphvite::SolverMixin<>::SolverMixin() @ 0x7f9f7713af1d _ZZN8pybind1112cpp_function10initializeIZNS_6detail8initimpl11constructorIJSt6vectorIiSaIiEEimEE7executeINS_6class_IN9graphvite11GraphSolverILm128EfjEEJEEEJNS_10call_guardIJNS_18gil_scoped_releaseEEEENS_5arg_vESI_SI_ELi0EEEvRT_DpRKT0_EUlRNS2_16value_and_holderES7_imE_vJSQ_S7_imEJNS_4nameENS_9is_methodENS_7siblingENS2_24is_new_style_constructorESH_SI_SI_SI_EEEvOSJ_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4FUNES17 @ 0x7f9f7704e529 pybind11::cpp_function::dispatcher() @ 0x55c62ccb9a55 _PyMethodDef_RawFastCallDict @ 0x55c62ccb9bd1 _PyCFunction_FastCallDict @ 0x55c62ccb7943 _PyObject_Call_Prepend @ 0x55c62ccaab9e PyObject_Call @ 0x55c62cc37726 slot_tp_init.cold.2021 @ 0x55c62ccf6bc7 type_call @ 0x55c62ccaab9e PyObject_Call @ 0x55c62cd439f5 _PyEval_EvalFrameDefault @ 0x55c62cc9b6f9 _PyEval_EvalCodeWithName @ 0x55c62cc9c805 _PyFunction_FastCallDict @ 0x55c62ccb79be _PyObject_Call_Prepend @ 0x55c62ccf938e slot_tp_new @ 0x55c62ccf6dc9 _PyObject_FastCallKeywords @ 0x55c62cd46e8f _PyEval_EvalFrameDefault @ 0x55c62cc9b6f9 _PyEval_EvalCodeWithName @ 0x55c62cc9ca30 _PyFunction_FastCallDict @ 0x55c62ccb7943 _PyObject_Call_Prepend @ 0x55c62ccaab9e PyObject_Call @ 0x55c62cd439f5 _PyEval_EvalFrameDefault @ 0x55c62cc9b6f9 _PyEval_EvalCodeWithName @ 0x55c62cc9ca30 _PyFunction_FastCallDict @ 0x55c62cd439f5 _PyEval_EvalFrameDefault @ 0x55c62cc9c030 _PyEval_EvalCodeWithName @ 0x55c62cc9ca30 _PyFunction_FastCallDict Aborted

KiddoZhu commented 4 years ago

Could you check any of these two? nvcc -V the CUDA version in the docker ldd libgraphvite.so | grep cuda the CUDA libraries GraphVite is linked to libgraphvite.so should be somewhere under /lib/python3.7/site-packages/graphvite/

zhixiaochuan12 commented 4 years ago

Sure.

nvcc -V -->

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130

the libgraphvite.so is in the dir /usr/local/anaconda3/lib/python3.7/site-packages/graphvite/lib.

ldd libgraphvite.so | grep cuda got the output:

libcudart.so.10.0 => /usr/local/anaconda3/lib/python3.7/site-packages/graphvite/lib/./../../../../libcudart.so.10.0 (0x00007fcc65a66000)

After going into the dir /usr/local/anaconda3/lib/python3.7/site-packages/graphvite/lib/./../../../../, run ls | grep cuda -->

cudatoolkit_config.yaml libcudart.so libcudart.so.10.0 libcudart.so.10.0.130 libicudata.a libicudata.so libicudata.so.58 libicudata.so.58.2

@KiddoZhu hi, do you have any other suggestions?

benman1 commented 4 years ago

I also get aborted, but after the training.

$ graphvite baseline quick start
running baseline: demo/quick_start.yaml
loading graph from /home/ben/.graphvite/dataset/blogcatalog/blogcatalog_train.txt
0.00018755%
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Graph<uint32>
------------------ Graph -------------------
#vertex: 10308, #edge: 327429
as undirected: yes, normalization: no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[time] GraphApplication.load: 0.0789478 s
[time] GraphApplication.build: 0.665795 s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<128, float32, uint32>
----------------- Resource -----------------
#worker: 1, #sampler: 7, #partition: 1
tied weights: no, episode size: 500
gpu memory limit: 23.3 GiB
gpu memory cost: 51.5 MiB
----------------- Sampling -----------------
augmentation step: 2, shuffle base: 2
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: LINE
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Batch id: 0 / 6548
loss = 0
Batch id: 1000 / 6548
loss = 0.387679
Batch id: 2000 / 6548
loss = 0.38329
Batch id: 3000 / 6548
loss = 0.379579
Batch id: 4000 / 6548
loss = 0.37587
Batch id: 5000 / 6548
loss = 0.372847
Batch id: 6000 / 6548
loss = 0.371051
[time] GraphApplication.train: 7.58616 s
------------- link prediction --------------
terminate called after throwing an instance of 'std::runtime_error'
  what():  generic_type: cannot initialize type "WorkerId": an object with that name is already defined
Aborted (core dumped)
benman1 commented 4 years ago

I think this is a fantastic library, but I burned through some time already trying to set up graphvite with different settings. A minimal example of how to make this run would really go a long way. How do I make this work under docker? A poster above suggested something along the lines of

docker pull nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04

...but that didn't work. Could you suggest a different image instead?

benman1 commented 4 years ago

I tried again after removing pytorch and reinstalling (instead of just -U upgrading). Now I am getting something more promising, but still not there.

$ graphvite baseline quick startrunning baseline: demo/quick_start.yamlloading graph from /home/ben/.graphvite/dataset/blogcatalog/blogcatalog_train.txt
0.00018755%
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Graph<uint32>
------------------ Graph -------------------
#vertex: 10308, #edge: 327429
as undirected: yes, normalization: no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[time] GraphApplication.load: 0.0757065 s
[time] GraphApplication.build: 0.659924 s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<128, float32, uint32>
----------------- Resource -----------------
#worker: 1, #sampler: 7, #partition: 1
tied weights: no, episode size: 500
gpu memory limit: 23.3 GiB
gpu memory cost: 51.5 MiB
----------------- Sampling -----------------
augmentation step: 2, shuffle base: 2
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: LINE
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Batch id: 0 / 6548
loss = 0
Batch id: 1000 / 6548
loss = 0.38787
Batch id: 2000 / 6548
loss = 0.383354
Batch id: 3000 / 6548
loss = 0.379194
Batch id: 4000 / 6548
loss = 0.375888
Batch id: 5000 / 6548
loss = 0.372947
Batch id: 6000 / 6548
loss = 0.370918
[time] GraphApplication.train: 7.64435 s
------------- link prediction --------------
effective edges: 6645 / 6650
effective filter edges: 327429 / 327429
remaining edges: 6645 / 6645
Traceback (most recent call last):
  File "/usr/local/bin/graphvite", line 11, in <module>
    load_entry_point('graphvite==0.2.2', 'console_scripts', 'graphvite')()
  File "/usr/local/lib/python3.7/site-packages/graphvite/cmd.py", line 273, in main
    command[args.command](args)
  File "/usr/local/lib/python3.7/site-packages/graphvite/cmd.py", line 235, in baseline_main
    app.evaluate(**evaluation)
  File "/usr/local/lib/python3.7/site-packages/graphvite/util.py", line 162, in wrapper
    result = function(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/graphvite/application/application.py", line 124, in evaluate
    result = getattr(self, func_name)(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/graphvite/application/application.py", line 436, in link_prediction
    model = model.cuda()
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 307, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 225, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 307, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 149, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 63, in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

Is there a way to conda (or pip) install a graphvite that works with my version, so I don't have to go through the painful CUDA re-installation? I noticed you have many files on conda-forge that correspond to different versions of cuda.

KiddoZhu commented 4 years ago

@benman1 We didn't try to install GraphVite in a Docker. For conda installation, I think it should be fine as long as you explicitly specify python and CUDA version, like conda install -c conda-forge graphvite python=3.7 cudatoolkit=10.0

According to your last log, GraphVite seems working correctly and the error is in the PyTorch evaluation steps. Maybe it's because the default PyTorch installed by conda is too new for the driver. Could you try to specify some version for PyTorch in conda install? GraphVite should work with any version >= 1.0