Advice on 256GB Memory to train a large graph

walton-wang929 commented 4 years ago

hello guys, I have a 256GB / 40 cores/ 4 V100 GPU server,

here are my KD intro:

The total of relations is 1825826. The total of entities is 76914149. The total of train triples is 382606627. The total of test triples is 58387397. The total of valid triples is 58387435.

I tried to use the whole data, 1/2, 1/4 part, these configurations all leads to OUT OF MEMORY.

the data split is 95/5/5 for train/val/test.

here is my shell: DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --data_path ./data/360KG/ --format udd_hrt \ --dataset 360KG --data_files entities.dict relation.dict train_1_4.txt valid_1_4_new.txt test_1_4_new.txt --save_path ./run-exp/360KG \ --max_step 30000 --batch_size 1000 --batch_size_eval 16 --neg_sample_size 200 --log_interval 100 \ --hidden_dim 400 --gamma 19.9 --lr 0.25 --regularization_coef 1.00E-09 \ --test -adv --mix_cpu_gpu --num_proc 40 --num_thread 20 --force_sync_interval 1000 \ --gpu 0 1 2 3 --regularization_coef 1e-9 --neg_sample_size_eval 10000 --no_eval_filter

thx for the help!!!!

classicsong commented 4 years ago

Your KG is very large. You have 400M edges, you can simple calculate your memory usage as storing COO for 400M edges with dim size 64 will take at least 200GB. It is better to find a machine with larger memory or use distributed training.

walton-wang929 commented 4 years ago

@classicsong thx for your advice. I reduced the hidden size from 400 to 64. Now can train. can you tell me how to calculate memory according to KG size? I don't find many documents about calculation.

classicsong commented 4 years ago

entity_embed_mem_size = num_entity  * 4 * hidden_size / 1024 / 1024 / 1024.
relation_embed_mem_size = num_rel  * 4 * hidden_size / 1024 / 1024 / 1024.
graph_mem_size = num_edges * 2 * 8 / 1024 / 1024 / 1024.

For storage we use COO, so it need to store num_edges 2 (one for head and one for tail) nodes. Each idx takes 8 bytes, then it will take about 6 GB mem. And for entity embeddings, each entity will take 4 hidden_size space to store its embedding. Same for the relation embeddings.

With dim=400, it will take 114.6 GB (entity embedding) and 2.7GB (relation embedding) and 6 GB for initial data storage. The system will take more memory for execution. Maybe you can also try dim=128 or 256.

walton-wang929 commented 4 years ago

it is very clear. thx you very much. @classicsong

awslabs / dgl-ke

Advice on 256GB Memory to train a large graph #162