awslabs / dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
https://dglke.dgl.ai/doc/
Apache License 2.0
1.28k stars 196 forks source link

Bus error when training the whole Freebase #250

Closed yhshu closed 2 years ago

yhshu commented 2 years ago

Hi there, I've encountered a similar problem, it shows bus error without core dumped. I'm using a docker with 500+GB memory to train the whole Freebase with ComplEx. I don't think it's because of the memory size. No matter GPUs are used or not, bus error always exists. The initialization is not even finished. Would anyone provide some help? Thanks.

dglke_train --model_name ComplEx --dataset Freebase --log_interval 100 \
> --batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.1 --max_step 50000 \
> --batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv --num_thread 1 --num_proc 48
Reading train triples....
Finished. Read 304727650 train triples.
Reading valid triples....
Finished. Read 16929318 valid triples.
Reading test triples....
Finished. Read 16929308 test triples.
|Train|: 304727650
random partition 304727650 edges into 48 parts
part 0 has 6348493 edges
part 1 has 6348493 edges
part 2 has 6348493 edges
part 3 has 6348493 edges
part 4 has 6348493 edges
part 5 has 6348493 edges
part 6 has 6348493 edges
part 7 has 6348493 edges
part 8 has 6348493 edges
part 9 has 6348493 edges
part 10 has 6348493 edges
part 11 has 6348493 edges
part 12 has 6348493 edges
part 13 has 6348493 edges
part 14 has 6348493 edges
part 15 has 6348493 edges
part 16 has 6348493 edges
part 17 has 6348493 edges
part 18 has 6348493 edges
part 19 has 6348493 edges
part 20 has 6348493 edges
part 21 has 6348493 edges
part 22 has 6348493 edges
part 23 has 6348493 edges
part 24 has 6348493 edges
part 25 has 6348493 edges
part 26 has 6348493 edges
part 27 has 6348493 edges
part 28 has 6348493 edges
part 29 has 6348493 edges
part 30 has 6348493 edges
part 31 has 6348493 edges
part 32 has 6348493 edges
part 33 has 6348493 edges
part 34 has 6348493 edges
part 35 has 6348493 edges
part 36 has 6348493 edges
part 37 has 6348493 edges
part 38 has 6348493 edges
part 39 has 6348493 edges
part 40 has 6348493 edges
part 41 has 6348493 edges
part 42 has 6348493 edges
part 43 has 6348493 edges
part 44 has 6348493 edges
part 45 has 6348493 edges
part 46 has 6348493 edges
part 47 has 6348479 edges
/home/aiscuser/.local/lib/python3.8/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
|valid|: 16929318
|test|: 16929308
Bus error
yhshu commented 2 years ago

Besides, training FB15K is totally fine with this machine.

classicsong commented 2 years ago

How many CPU memory you have? Probably, this is caused by out of memory. You also need to check the shared memory size.

yhshu commented 2 years ago

How many CPU memory you have? Probably, this is caused by out of memory. You also need to check the shared memory size.

About 600GB but in a docker. It maybe the memory issue.

yhshu commented 2 years ago

I‘ve tried a smaller subgraph, nearly 10% of Freebase, and it worked out. Thanks.