Closed AlexMRuch closed 4 years ago
I see what is going on. The memory error happens in evaluation. I think you are using all nodes to generate negative edges, which will consume a lot of memroy. My suggestion is to use --neg_sample_size_eval
to limit the number of negative edges you want to use for evaluation.
Could you tell me how many nodes in your knowledge graph?
Ah, yes, I just saw that in the --help
output you shared and was going to put that in.
amruch@wit:~/graphika/kg/results_SXSW_2018$ wc -l entities.tsv
10469672 entities.tsv
amruch@wit:~/graphika/kg/results_SXSW_2018$ wc -l relations.tsv
16 relations.tsv
^^^ That's how many unique entities and relations I have (but I sample down entities that don't appear at least 10 times, which knocks that down by about a million).
Right now I have --batch_size_eval 16
and will have --neg_sample_size_eval 16
(but haven't run it with --neg_sample_size_eval
yet (and probably have to do hyperparameter tuning for those number. If you have any suggestions, I'm more than welcome to them.
Thanks again!
It sounds like --eval_percent 20
be useful here too (unless that's too low). If that's the case, does that mean I can have my training set larger and cut the size of the set that's used for evaluation (is that the validation or test set)? I originally did 80% for training, 10% for validation, and 10% for testing.
Here's what I have now:
DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 --model_name ComplEx \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv --format udd_hrt \
--max_step 50000 --batch_size 1000 --batch_size_eval 16 --eval_percent 20 \
--neg_sample_size 200 --neg_sample_size_eval 16 --neg_deg_sample \
--hidden_dim 400 --gamma 19.9 --lr 0.25 --regularization_coef=1e-9 -adv \
--gpu 0 1 --async_update --force_sync_interval 1000 --log_interval 1000 \
--test
it depends on how many edges you have. I don't think you need to have so many edges for validation and testing, especially, you are thinking of using 20% for evaluation.
If your goal is to train node embeddings and relation embeddings, I would suggest you have some split (90% for training, 5% for validation and 5% for testing) for hyperparameter tuning, and use all of the edges and the best hyperparameters to train the KG again from scratch to train the embeddigns.
Another thing is that neg_sample_size_eval
needs to be sufficiently large. I would suggest something like 10,000 or 100,000.
Thanks! I'll add in the change for neg_sample_size_eval
.
I'll also redo my splits for 90%, 5%, 5% for training, validation, and testing for hyperparameter tuning and then will rerun the model in full to get the full embeddings once my hyperparameters are set.
One problem I noted after running the method with the --eval_percent 20
last night is that my output still says the dataset
is FK15kinstead of
results_SXSW_2018`
DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv --format udd_hrt \
--model_name ComplEx \
--max_step 100000 --batch_size 1024 --neg_sample_size 256 --neg_deg_sample --log_interval 1000 \
--hidden_dim 400 --gamma 128 --lr 0.1 -adv --regularization_coef 2.00E-06 \
--mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000 \
--batch_size_eval 1024 --neg_sample_size_eval 10000 --eval_percent 20 \
--test --no_save_emb
For example, my config.json
file shows
dataset:"FB15k"
model:"ComplEx"
emb_size:400
max_train_step:50000
batch_size:1000
neg_sample_size:200
lr:0.25
gamma:19.9
double_ent:false
double_rel:false
neg_adversarial_sampling:true
adversarial_temperature:1
regularization_coef:1e-9
regularization_norm:3
I know that the correct dataset is being loaded as I confirmed the number of canonical tuples in the printout matches the length of my files for SXSW data, but is there something else I need to do to make sure config.json
reports dataset:"results_SXSW_2018"
?
Thank you so much for your attention and patience. I'm very grateful! I hope these notes on here help others who may hit similar isssues.
--eval_percent EVAL_PERCENT
Randomly sample some percentage of edges for evaluation.
For --eval_percent
above, is "evaluation" in this context validation and testing or just validation? If it's just validation, then 1) I presume adding --test
will run through and test all edges in the whole test.tsv file and 2) I presume including it has no point if --valid
isn't in the run config for the training command. I could definitely see "eval" here applying to validation and testing, given the other options with eval
in the name that refer to testing; however, if that's the case, then how does one separately adjust validation and testing params (e.g., only test a random 20% of edges during validation but test all edges during testing)?
eval_percent
applies to both validation and test. Actually, every option with eval
applies both validation and test. the only exception is eval_interval
because test only runs once.
for your previous question about the dataset name saved in the conf file. It's indeed a bug. We'll fix it as soon as possible.
Thank you so much for this feedback!
After running with the following configuration
DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 \
--data_files entities.tsv relations.tsv train.tsv valid.tsv test.tsv --format udd_hrt \
--model_name ComplEx \
--max_step 100000 --batch_size 1024 --neg_sample_size 256 --neg_deg_sample --log_interval 1000 \
--hidden_dim 512 --gamma 128 --lr 0.085 -adv --regularization_coef 2.00E-06 \
--mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000 \
--batch_size_eval 1024 --neg_sample_size_eval 10000 --eval_percent 20 \
--test --no_save_emb
I am able to achieve much more reasonable results:
training takes 4096.045665502548 seconds
-------------- Test result --------------
Test average MRR : 0.8105859111552105
Test average MR : 105.64514485059483
Test average HITS@1 : 0.741847968505899
Test average HITS@3 : 0.8631606615257642
Test average HITS@10 : 0.9307773698882217
-----------------------------------------
testing takes 16359.690 seconds
I am still hyperparameter tuning; however, and I am wondering if I set my parameters incorrectly as testing now takes quite some time (and I stopped including --valid
for the same reason. I think this is likely because of increasing--neg_sample_size_eval
, but I also noticed while testing, my GPUs seem pretty inactive and are using much less memory:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:0A:00.0 Off | N/A |
| 0% 26C P5 17W / 250W | 781MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 00000000:41:00.0 On | N/A |
| 0% 27C P2 59W / 250W | 1064MiB / 11173MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Is testing done on the CPUs? I have two CPU processes each running ~500% (I have 32 cores).
Any other recommendations are more than welcome! I'll note that increasing --max_step
to 200000 didn't improve result; however, that could also be because my learning rate is too high. I may also change --regularization_coef
from 2.00E-06 to 1.00E-09.
Thank you again for this feedback. If you would like me to write up anything to help with your documentation for testing on non-built-in datasets, I am more than happy to contribute to help the project and others who use the library!
I think the main reason is that when you run testing, it generates a set of filtered negative edges (remove the positive edges from the randomly generated negative edges). This operation is computationally expensive and runs on CPU. Do you care about evaluating on a set of filtered negative edges or unfiltered negative edges? I see some people use unfiltered negative edges for evaluation. The performance number will be much lower.
Ah, I see. That makes a lot of sense. Thank you for that! In that case, if I don't care about filtering negative edges, which part of my command configuration should I remove, the --neg_deg_sample
option?
I forgot to mention that you can turn on --no_eval_filter
to disable filtering for evaluation (both validation and testing).
Okay, great. Thanks! It sounds like I should use --no_eval_filter
for hyperparameter tuning and then remove it when I want to do my final training on the full set of canonical edges to embed all nodes and edges.
To confirm, does --no_eval_filter
(Disable filter positive edges from randomly constructed negative edges for evaluation) mean that during evaluation (validation and testing) the negative edges sampled may include positive edges as well? And, if so, if a positive edge is included in the negative edge sample and if the model correctly predicts that the source entity and relational edge do connect to the destination entity, then that should still contribute positively to the model's evaluation accuracy (i.e., accuracy will increase). Turning this option on just means that if you have --neg_sample_size_eval 1000
in the runtime command then it's possibly negative samples may include some positive samples and the true number of negative samples may only be something like 983.
Thanks again! The fact that filtering happens on the CPUs makes a lot of sense. I'll add that in on my next run!
yes, your understanding is correct.
Perfect! Thanks so much for all this awesome help! I really appreciate it and hope that some of my questions were helpful for identifying where and how the documentation may be improved to help others!
yes, this is exactly what I'm thinking. Your questions and feedbacks are very valuable. It's great help for us to improve our documentation.
On last question –
So adding --no_eval_filter
dramatically cut my testing time from 4.5 hours to just 7 minutes (thank you, thank you, thank you). However, I noticed that my accuracies for HITS@1 and HITS@3 dropped considerably from
training takes 4098.758062839508 seconds
-------------- Test result --------------
Test average MRR : 0.8789477907435576
Test average MR : 134.13138276912403
Test average HITS@1 : 0.8269916438423608
Test average HITS@3 : 0.9245780584202233
Test average HITS@10 : 0.9606954600263039
-----------------------------------------
testing takes 16199.487 seconds
to
training takes 4101.1030061244965 seconds
-------------- Test result --------------
Test average MRR : 0.7296787114483176
Test average MR : 138.02974488952307
Test average HITS@1 : 0.6100550059742301
Test average HITS@3 : 0.821572043956075
Test average HITS@10 : 0.9515992954802492
-----------------------------------------
testing takes 415.771 seconds
Is this because my model without no_eval_filter
is more accurate in predicting that source nodes do not have edges to false destination nodes (i.e., negative samples) and that the model is in reality less accurate in predicting that source nodes do have edges to true destination nodes (i.e., positive samples)? Or is this because having some positive samples included in the negative samples is throwing off my model in some other way?
The only thing I changed from the top model to the bottom is including no_eval_filter
.
Thanks again!
The performance downgrade may due to some positive samples included in the negative samples. Another thing you can try for speedup evaluation is use multi-process eval.
The performance downgrade may due to some positive samples included in the negative samples.
Ah, I think I'm not exactly understanding how --no_eval_filter
works. Does including --no_eval_filter
and having some positive samples included in the negative samples mean that the model will treat those truly positive samples as if they should be negative (i.e., it includes false negatives in the model training)? In that case, won't including --no_eval_filter
always degrade the model performance and decrease accuracy (at the cost of faster evaluation)?
Another thing you can try for speedup evaluation is use multi-process eval.
How do I do this? I already have --mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000
in my config and don't see anything else about multiprocessing in the --help
output.
Thanks so much for your feedback!
The performance downgrade may due to some positive samples included in the negative samples.
Ah, I think I'm not exactly understanding how
--no_eval_filter
works. Does including--no_eval_filter
and having some positive samples included in the negative samples mean that the model will treat those truly positive samples as if they should be negative (i.e., it includes false negatives in the model training)? In that case, won't including--no_eval_filter
always degrade the model performance and decrease accuracy (at the cost of faster evaluation)?
For dataset like Full Freebase, the change of sampling a false negative pair is low. So you can use no_eval_filter to trade evaluation speed with test accuracy.
Another thing you can try for speedup evaluation is use multi-process eval.
How do I do this? I already have
--mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000
in my config and don't see anything else about multiprocessing in the--help
output.
--num_proc is how many process you are using for both training and evaluation. As the filtered evaluation will take 4.5 hours, you can use --no_eval_filter when you are tuning the model. After the model is set, you can use dglke_eval to eval the model. Here is the guideline: https://aws-dglke.readthedocs.io/en/latest/hyper_param.html#evaluation-on-pre-trained-embeddings
Thanks so much for your feedback!
Gotcha, that sounds like a good plan to use --no_eval_filter
for hyperparameter tuning and then keep it off when doing evaluation after I find the best hyperparameters. Thanks!
It doesn't seem like there are a lot of options for limiting memory consumption in
dgl-ke
at the moment, so I was wondering if you have any suggestions for my problem. Presently, my model is running out of ram atThe above was run with
And has the following number of canonical tuples
My machine has two 1080 Ti GPUs and 128GB of RAM. So this pretty much used up all the RAM right away, which is odd because the
graphvite
run on this knowledge graph finished fine (but took ~8 hours).