awslabs / dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
https://dglke.dgl.ai/doc/
Apache License 2.0
1.27k stars 195 forks source link

Advice on Limiting Memory #84

Closed AlexMRuch closed 4 years ago

AlexMRuch commented 4 years ago

It doesn't seem like there are a lot of options for limiting memory consumption in dgl-ke at the moment, so I was wondering if you have any suggestions for my problem. Presently, my model is running out of ram at

[proc 0][Train](12000/12000) average regularization: 0.00017675260825490114
[proc 0][Train] 1000 steps take 12.623 seconds
[proc 0]sample: 2.133, forward: 5.806, backward: 2.516, update: 2.070
proc 0 takes 161.118 seconds
training takes 162.84567785263062 seconds
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/pytorch/tensor_models.py", line 77, in decorated_function
    raise exception.__class__(trace)
RuntimeError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/pytorch/tensor_models.py", line 65, in _queue_result
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dglke/train_pytorch.py", line 238, in test_mp
    test(args, model, test_samplers, rank, mode, queue)
  File "/usr/local/lib/python3.6/dist-packages/dglke/train_pytorch.py", line 214, in test
    model.forward_test(pos_g, neg_g, logs, gpu_id)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/general_models.py", line 321, in forward_test
    neg_deg_sample=self.args.neg_deg_sample_eval)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/general_models.py", line 243, in predict_neg_score
    neg_head = self.entity_emb(neg_head_ids, gpu_id, trace)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/pytorch/tensor_models.py", line 203, in __call__
    s = self.emb[idx]
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 16751475200 bytes. Error code 12 (Cannot allocate memory)

The above was run with

DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv \
--format udd_hrt \
--model_name ComplEx \
--max_step 50000 --batch_size 1000 --neg_sample_size 200 --batch_size_eval 16 \
--hidden_dim 400 --gamma 19.9 --lr 0.25 --regularization_coef=1e-9 -adv \
--gpu 0 1 --async_update --force_sync_interval 1000 --log_interval 1000 \
--test

And has the following number of canonical tuples

Reading train triples....
Finished. Read 91802780 train triples.
Reading valid triples....
Finished. Read 10200309 valid triples.
Reading test triples....
Finished. Read 11333677 test triples.
|Train|: 91802780
random partition 91802780 edges into 2 parts
part 0 has 45901390 edges
part 1 has 45901390 edges

My machine has two 1080 Ti GPUs and 128GB of RAM. So this pretty much used up all the RAM right away, which is odd because the graphvite run on this knowledge graph finished fine (but took ~8 hours).

zheng-da commented 4 years ago

I see what is going on. The memory error happens in evaluation. I think you are using all nodes to generate negative edges, which will consume a lot of memroy. My suggestion is to use --neg_sample_size_eval to limit the number of negative edges you want to use for evaluation.

Could you tell me how many nodes in your knowledge graph?

AlexMRuch commented 4 years ago

Ah, yes, I just saw that in the --help output you shared and was going to put that in.

amruch@wit:~/graphika/kg/results_SXSW_2018$ wc -l entities.tsv
10469672 entities.tsv
amruch@wit:~/graphika/kg/results_SXSW_2018$ wc -l relations.tsv
16 relations.tsv

^^^ That's how many unique entities and relations I have (but I sample down entities that don't appear at least 10 times, which knocks that down by about a million).

Right now I have --batch_size_eval 16 and will have --neg_sample_size_eval 16 (but haven't run it with --neg_sample_size_eval yet (and probably have to do hyperparameter tuning for those number. If you have any suggestions, I'm more than welcome to them.

Thanks again!

AlexMRuch commented 4 years ago

It sounds like --eval_percent 20 be useful here too (unless that's too low). If that's the case, does that mean I can have my training set larger and cut the size of the set that's used for evaluation (is that the validation or test set)? I originally did 80% for training, 10% for validation, and 10% for testing.

Here's what I have now:

DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 --model_name ComplEx \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv --format udd_hrt \
--max_step 50000 --batch_size 1000 --batch_size_eval 16 --eval_percent 20 \
--neg_sample_size 200 --neg_sample_size_eval 16 --neg_deg_sample \
--hidden_dim 400 --gamma 19.9 --lr 0.25 --regularization_coef=1e-9 -adv \
--gpu 0 1 --async_update --force_sync_interval 1000 --log_interval 1000 \
--test
zheng-da commented 4 years ago

it depends on how many edges you have. I don't think you need to have so many edges for validation and testing, especially, you are thinking of using 20% for evaluation.

If your goal is to train node embeddings and relation embeddings, I would suggest you have some split (90% for training, 5% for validation and 5% for testing) for hyperparameter tuning, and use all of the edges and the best hyperparameters to train the KG again from scratch to train the embeddigns.

Another thing is that neg_sample_size_eval needs to be sufficiently large. I would suggest something like 10,000 or 100,000.

AlexMRuch commented 4 years ago

Thanks! I'll add in the change for neg_sample_size_eval.

I'll also redo my splits for 90%, 5%, 5% for training, validation, and testing for hyperparameter tuning and then will rerun the model in full to get the full embeddings once my hyperparameters are set.

One problem I noted after running the method with the --eval_percent 20 last night is that my output still says the dataset is FK15kinstead ofresults_SXSW_2018`

DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv --format udd_hrt \
--model_name ComplEx \
--max_step 100000 --batch_size 1024 --neg_sample_size 256 --neg_deg_sample --log_interval 1000 \
--hidden_dim 400 --gamma 128 --lr 0.1 -adv --regularization_coef 2.00E-06 \
--mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000 \
--batch_size_eval 1024 --neg_sample_size_eval 10000 --eval_percent 20 \
--test --no_save_emb

For example, my config.json file shows

dataset:"FB15k"
model:"ComplEx"
emb_size:400
max_train_step:50000
batch_size:1000
neg_sample_size:200
lr:0.25
gamma:19.9
double_ent:false
double_rel:false
neg_adversarial_sampling:true
adversarial_temperature:1
regularization_coef:1e-9
regularization_norm:3

I know that the correct dataset is being loaded as I confirmed the number of canonical tuples in the printout matches the length of my files for SXSW data, but is there something else I need to do to make sure config.json reports dataset:"results_SXSW_2018"?

Thank you so much for your attention and patience. I'm very grateful! I hope these notes on here help others who may hit similar isssues.

AlexMRuch commented 4 years ago
  --eval_percent EVAL_PERCENT
                        Randomly sample some percentage of edges for evaluation.

For --eval_percent above, is "evaluation" in this context validation and testing or just validation? If it's just validation, then 1) I presume adding --test will run through and test all edges in the whole test.tsv file and 2) I presume including it has no point if --valid isn't in the run config for the training command. I could definitely see "eval" here applying to validation and testing, given the other options with eval in the name that refer to testing; however, if that's the case, then how does one separately adjust validation and testing params (e.g., only test a random 20% of edges during validation but test all edges during testing)?

zheng-da commented 4 years ago

eval_percent applies to both validation and test. Actually, every option with eval applies both validation and test. the only exception is eval_interval because test only runs once.

zheng-da commented 4 years ago

for your previous question about the dataset name saved in the conf file. It's indeed a bug. We'll fix it as soon as possible.

AlexMRuch commented 4 years ago

Thank you so much for this feedback!

After running with the following configuration

DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv --format udd_hrt \
--model_name ComplEx \
--max_step 100000 --batch_size 1024 --neg_sample_size 256 --neg_deg_sample --log_interval 1000 \
--hidden_dim 512 --gamma 128 --lr 0.085 -adv --regularization_coef 2.00E-06 \
--mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000 \
--batch_size_eval 1024 --neg_sample_size_eval 10000 --eval_percent 20 \
--test --no_save_emb

I am able to achieve much more reasonable results:

training takes 4096.045665502548 seconds
-------------- Test result --------------
Test average MRR : 0.8105859111552105
Test average MR : 105.64514485059483
Test average HITS@1 : 0.741847968505899
Test average HITS@3 : 0.8631606615257642
Test average HITS@10 : 0.9307773698882217
-----------------------------------------
testing takes 16359.690 seconds

I am still hyperparameter tuning; however, and I am wondering if I set my parameters incorrectly as testing now takes quite some time (and I stopped including --valid for the same reason. I think this is likely because of increasing--neg_sample_size_eval, but I also noticed while testing, my GPUs seem pretty inactive and are using much less memory:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:0A:00.0 Off |                  N/A |
|  0%   26C    P5    17W / 250W |    781MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:41:00.0  On |                  N/A |
|  0%   27C    P2    59W / 250W |   1064MiB / 11173MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Is testing done on the CPUs? I have two CPU processes each running ~500% (I have 32 cores).

Any other recommendations are more than welcome! I'll note that increasing --max_step to 200000 didn't improve result; however, that could also be because my learning rate is too high. I may also change --regularization_coef from 2.00E-06 to 1.00E-09.

Thank you again for this feedback. If you would like me to write up anything to help with your documentation for testing on non-built-in datasets, I am more than happy to contribute to help the project and others who use the library!

zheng-da commented 4 years ago

I think the main reason is that when you run testing, it generates a set of filtered negative edges (remove the positive edges from the randomly generated negative edges). This operation is computationally expensive and runs on CPU. Do you care about evaluating on a set of filtered negative edges or unfiltered negative edges? I see some people use unfiltered negative edges for evaluation. The performance number will be much lower.

AlexMRuch commented 4 years ago

Ah, I see. That makes a lot of sense. Thank you for that! In that case, if I don't care about filtering negative edges, which part of my command configuration should I remove, the --neg_deg_sample option?

zheng-da commented 4 years ago

I forgot to mention that you can turn on --no_eval_filter to disable filtering for evaluation (both validation and testing).

AlexMRuch commented 4 years ago

Okay, great. Thanks! It sounds like I should use --no_eval_filter for hyperparameter tuning and then remove it when I want to do my final training on the full set of canonical edges to embed all nodes and edges.

To confirm, does --no_eval_filter (Disable filter positive edges from randomly constructed negative edges for evaluation) mean that during evaluation (validation and testing) the negative edges sampled may include positive edges as well? And, if so, if a positive edge is included in the negative edge sample and if the model correctly predicts that the source entity and relational edge do connect to the destination entity, then that should still contribute positively to the model's evaluation accuracy (i.e., accuracy will increase). Turning this option on just means that if you have --neg_sample_size_eval 1000 in the runtime command then it's possibly negative samples may include some positive samples and the true number of negative samples may only be something like 983.

Thanks again! The fact that filtering happens on the CPUs makes a lot of sense. I'll add that in on my next run!

zheng-da commented 4 years ago

yes, your understanding is correct.

AlexMRuch commented 4 years ago

Perfect! Thanks so much for all this awesome help! I really appreciate it and hope that some of my questions were helpful for identifying where and how the documentation may be improved to help others!

zheng-da commented 4 years ago

yes, this is exactly what I'm thinking. Your questions and feedbacks are very valuable. It's great help for us to improve our documentation.

AlexMRuch commented 4 years ago

On last question –

So adding --no_eval_filter dramatically cut my testing time from 4.5 hours to just 7 minutes (thank you, thank you, thank you). However, I noticed that my accuracies for HITS@1 and HITS@3 dropped considerably from

training takes 4098.758062839508 seconds
-------------- Test result --------------
Test average MRR : 0.8789477907435576
Test average MR : 134.13138276912403
Test average HITS@1 : 0.8269916438423608
Test average HITS@3 : 0.9245780584202233
Test average HITS@10 : 0.9606954600263039
-----------------------------------------
testing takes 16199.487 seconds

to

training takes 4101.1030061244965 seconds
-------------- Test result --------------
Test average MRR : 0.7296787114483176
Test average MR : 138.02974488952307
Test average HITS@1 : 0.6100550059742301
Test average HITS@3 : 0.821572043956075
Test average HITS@10 : 0.9515992954802492
-----------------------------------------
testing takes 415.771 seconds

Is this because my model without no_eval_filter is more accurate in predicting that source nodes do not have edges to false destination nodes (i.e., negative samples) and that the model is in reality less accurate in predicting that source nodes do have edges to true destination nodes (i.e., positive samples)? Or is this because having some positive samples included in the negative samples is throwing off my model in some other way?

The only thing I changed from the top model to the bottom is including no_eval_filter.

Thanks again!

classicsong commented 4 years ago

The performance downgrade may due to some positive samples included in the negative samples. Another thing you can try for speedup evaluation is use multi-process eval.

AlexMRuch commented 4 years ago

The performance downgrade may due to some positive samples included in the negative samples.

Ah, I think I'm not exactly understanding how --no_eval_filter works. Does including --no_eval_filter and having some positive samples included in the negative samples mean that the model will treat those truly positive samples as if they should be negative (i.e., it includes false negatives in the model training)? In that case, won't including --no_eval_filter always degrade the model performance and decrease accuracy (at the cost of faster evaluation)?

Another thing you can try for speedup evaluation is use multi-process eval.

How do I do this? I already have --mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000 in my config and don't see anything else about multiprocessing in the --help output.

Thanks so much for your feedback!

classicsong commented 4 years ago

The performance downgrade may due to some positive samples included in the negative samples.

Ah, I think I'm not exactly understanding how --no_eval_filter works. Does including --no_eval_filter and having some positive samples included in the negative samples mean that the model will treat those truly positive samples as if they should be negative (i.e., it includes false negatives in the model training)? In that case, won't including --no_eval_filter always degrade the model performance and decrease accuracy (at the cost of faster evaluation)?

For dataset like Full Freebase, the change of sampling a false negative pair is low. So you can use no_eval_filter to trade evaluation speed with test accuracy.

Another thing you can try for speedup evaluation is use multi-process eval.

How do I do this? I already have --mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000 in my config and don't see anything else about multiprocessing in the --help output.

--num_proc is how many process you are using for both training and evaluation. As the filtered evaluation will take 4.5 hours, you can use --no_eval_filter when you are tuning the model. After the model is set, you can use dglke_eval to eval the model. Here is the guideline: https://aws-dglke.readthedocs.io/en/latest/hyper_param.html#evaluation-on-pre-trained-embeddings

Thanks so much for your feedback!

AlexMRuch commented 4 years ago

Gotcha, that sounds like a good plan to use --no_eval_filter for hyperparameter tuning and then keep it off when doing evaluation after I find the best hyperparameters. Thanks!