DeepGraphLearning / graphvite

GraphVite: A General and High-performance Graph Embedding System
https://graphvite.io
Apache License 2.0
1.21k stars 151 forks source link

Illegal memory access #67

Open ajzenhamernikola opened 4 years ago

ajzenhamernikola commented 4 years ago

Hi, I'm currently encountering the following problem when trying to use node2vec for node embedding:

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
GraphSolver<32, float32, uint32>
----------------- Resource -----------------
#worker: 1, #sampler: 11, #partition: 1
tied weights: no, episode size: 200
gpu memory limit: 3.53 GiB
gpu memory cost: 59.6 MiB
----------------- Sampling -----------------
augmentation step: 1, p: 1, q: 1
random walk length: 40
random walk batch size: 100
#negative: 1, negative sample exponent: 0.75
----------------- Training -----------------
model: node2vec
optimizer: SGD
learning rate: 0.025, lr schedule: linear
weight decay: 0.005
#epoch: 2000, batch size: 100000
resume: no
positive reuse: 1, negative weight: 5
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Batch id: 0 / 7122
loss = -nan

Check failed: error == cudaSuccess CUDA error an illegal memory access was encountered at /network/home/zhuzhaoc/.local/envs/build/conda-bld/graphvite_1584598935508/work/include/core/solver.h:1539
*** Check failure stack trace: ***
    @     0x7f5f873d24dd  google::LogMessage::Fail()
    @     0x7f5f873da071  google::LogMessage::SendToLog()
    @     0x7f5f873d1ecd  google::LogMessage::Flush()
    @     0x7f5f873d376a  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f5f87537b7b  graphvite::WorkerMixin<>::train()
    @     0x7f5ff1791163  execute_native_thread_routine
    @     0x7f5fff02b609  start_thread
    @     0x7f5ffef52103  clone
    @              (nil)  (unknown)

The code I'm using is (this is inside of a loop which loads different graphs using different edgelist_filename: str):

# Prepare graph for Node2Vec
v_graph = vite_graph.Graph()
v_graph.load(edgelist_filename, as_undirected=False)

# Train Node2Vec hidden data
embed = vite_solver.GraphSolver(dim=32)
embed.build(v_graph)
embed.train(model='node2vec', num_epoch=2000, resume=False, augmentation_step=1, random_walk_length=40,
            random_walk_batch_size=100, shuffle_base=1, p=1, q=1, positive_reuse=1,
            negative_sample_exponent=0.75, negative_weight=5, log_frequency=1000)

# Extract embedded feature data
features = np.array(np.copy(embed.vertex_embeddings), dtype=np.float32)

# Clear memory and data on CPU and GPU
embed.clear()

The weird thing is that this happens completely sporadically. And the next time I run the same code (on the same edgelist_filename instance), the code works. So really, the only problem I have is that I need to keep running the code over and over again until all my data is processed.

I'm using cudatoolkit=10.1 and graphvite version 0.2.2 build py37cuda101hd3e7edd from conda channel milagraph.

KiddoZhu commented 4 years ago

It's an illegal memory access in GPUs. Really weird. Could you provide any graph dataset that can reproduce this error?

ajzenhamernikola commented 4 years ago

Here is the smallest dataset I used to reproduce the error: deepgraphlearning-graphvite-issue-67.zip I have also attached the output from the terminal and filenames which were used. While processing the last file (SAT_Competition2009/CRAFTED/rbsat/random/unforced/rbsat-v760c43649g4.cnf.edgelist), the program crashes.

Note that, while processing the dataset above, I used the dimension size of 64 instead of 32 as I described above. Everything else remained the same.

KiddoZhu commented 4 years ago

Thanks! We will try to reproduce that.

porimol commented 3 years ago

@KiddoZhu And I'm also encountering the same problem! Btw, is there any update?

HanwGeek commented 3 years ago

@KiddoZhu Hi, I'm also suffering from the same problem. With description:

Check failed: error == cudaSuccess CUDA error an illegal memory access was encountered at /network/home/zhuzhaoc/.local/envs/build/conda-bld/graphvite_1584598935508/work/include/core/solver.h:1539