MichSchli / RelationPrediction

Implementation of R-GCNs for Relational Link Prediction
MIT License
435 stars 103 forks source link

OOM #8

Open Chen-Cai-OSU opened 5 years ago

Chen-Cai-OSU commented 5 years ago

Hello,

I was wondering how much GPU memory is needed to replicate the result in the paper? I tried on all three datasets but for all of them, I run into OOM issue.

AndRossi commented 5 years ago

I am experiencing the same issue.

I am trying to run with gcn_block configuration on FB15k, but if I keep the configuration as it is, the training won't even start: I get error:

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had Inf values
   [[node VerifyFinite/CheckNumerics (defined at /home/nvidia/workspace/dbgroup/andrea/comparative_analysis/models/R-GCN/code/optimization/tensorflow_backend/algorithms.py:67) ]]

I have found online that this may depend on memory allocations, so I have reduced the batch size in gcn_block from 30000 to 300. With this configuration the training does start, but after 2000 iterations it still goes OOM:

2019-06-17 12:06:28.728707: W tensorflow/core/common_runtime/bfc_allocator.cc:271] *****_____*************************************************************************_________________
2019-06-17 12:06:28.728765: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at gather_op.cc:103 : 
Resource exhausted: OOM when allocating tensor with shape[483142,100,5,5] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I am using a Tesla P100-SXM2-16GB, so it is a rather powerful GPU.

rayrayraykk commented 5 years ago

I also get OOM,so I run the code with graph batch size=300,and Dimension=200,however i get the score which is far reach the targets.

Raw Filtered MRR 0.031 0.037 H@1 0.009 0.014 H@3 0.029 0.038 H@10 0.068 0.074 Tested validation score at iteration 12000. Result: 0.0366963831726

any suggestion so that i can get the expected score,and I wonder why this code would OOM when verificating the score?(using V100-16G and P100-16G)

Lee-zix commented 5 years ago

I also experience the same OOM questions on FB15K-237 and FB15K dataset with gcn_block(K80-12G), and the result on FB15K-237 with gcn_basis is far from the reported result, even if it is the score in the validation set: MRR 0.096 0.151 H@1 0.047 0.091 H@3 0.097 0.16 H@10 0.191 0.268

vamships commented 4 years ago

I encountered the same error when running on Tesla K80 (12GB), but it ran fine on a Tesla V100-SXM2 (16GB). So, I guess hardware must be the cause for this issue. However, I couldn't get the performance numbers as shown in the paper. Following are the numbers I got:

        Raw     Filtered                                                                                                               
MRR     0.03    0.034                                                                                                                  
H@1     0.013   0.015                                                                                                                  
H@3     0.025   0.029                                                                                                                  
H@10    0.054   0.066                                                                                                                  
Tested validation score at iteration 10000. Result: 0.03525089858321644
Stopping criterion reached.                                                                                                                                                  
Stopping training.       

Are there any settings that need to be tweaked to get the expected performance? I used the command: bash run-train.sh /settings/gcn_block.exp

@MichSchli @tkipf

vamships commented 4 years ago

Turns out, the issue was caused by difference in tensorflow versions. I was able to replicate the performance using tensorflow 1.4.0. Hope this helps.

pbloem commented 4 years ago

@vamships Could you confirm which configuration you ran on which dataset and what the resulting performance was? We are having some trouble trying to get the code running again.

vamships commented 4 years ago

@vamships Could you confirm which configuration you ran on which dataset and what the resulting performance was? We are having some trouble trying to get the code running again.

I believe it was on the FB-Toutanova dataset, which was default with the command: bash run-train.sh /settings/gcn_block.exp. The performance was close to what was reported to in Table 5 of the paper.

ideafang commented 4 years ago

Hello, I used two RTX 2080Ti to run the command: bash run-train.sh /settings/gcn_block.exp, and it was working. However I checked nvidia-smi and found the volatile gpu-util is 0%, but the cpu use was 100%. I waited half an hour but there is nothing changed on the terminal. Is this normal situation? the terminal are as follows:

(tf-gnn) user@admin:~/workspace/RelationPrediction$ bash run-train.sh /settings/gcn_block.exp
{'Optimizer': {'EarlyStopping': {'CheckEvery': '2000', 'BurninPhaseDuration': '6000'}, 'MaxGradientNorm': '1', 'Algorithm': {'Name': 'Adam', 'learning_rate': '0.01'}, 'ReportTrainLossEvery': '100'}, 'Evaluation': {'Metric': 'MRR'}, 'General': {'GraphSplitSize': '0.5', 'NegativeSampleRate': '10', 'GraphBatchSize': '30000', 'ExperimentName': 'models/GcnBlock'}, 'Encoder': {'PartiallyRandomInput': 'No', 'DiagonalCoefficients': 'No', 'SkipConnections': 'None', 'DropoutKeepProbability': '0.8', 'StoreEdgeData': 'No', 'Concatenation': 'Yes', 'UseInputTransform': 'Yes', 'Name': 'gcn_basis', 'InternalEncoderDimension': '500', 'RandomInput': 'No', 'NumberOfBasisFunctions': '100', 'NumberOfLayers': '2', 'UseOutputTransform': 'No', 'AddDiagonal': 'No'}, 'Shared': {'CodeDimension': '500'}, 'Decoder': {'RegularizationParameter': '0.01', 'Name': 'bilinear-diag'}}
272115
[<tf.Tensor 'graph_edges:0' shape=(?, 3) dtype=int32>, <tf.Tensor 'Placeholder_1:0' shape=(?, 3) dtype=int32>, <tf.Tensor 'Placeholder:0' shape=(?,) dtype=float32>]
2020-07-31 20:40:49.158103: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-07-31 20:40:49.529933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:82:00.0
totalMemory: 10.76GiB freeMemory: 10.60GiB
2020-07-31 20:40:49.718515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:83:00.0
totalMemory: 10.76GiB freeMemory: 10.60GiB
2020-07-31 20:40:49.718681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2020-07-31 20:40:49.718779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 
2020-07-31 20:40:49.718791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N 
2020-07-31 20:40:49.718798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y 
2020-07-31 20:40:49.718809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:82:00.0, compute capability: 7.5)
2020-07-31 20:40:49.718818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5)
SampleTransformer
GradientClipping
Adam
TrainLossReporter
EarlyStopper
ModelSaver
WARNING:tensorflow:From /home/user/anaconda3/envs/tf-gnn/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py:107: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
Initial loss: 1.22622
pbloem commented 4 years ago

@xiaofangdyd We're seeing the same thing in our attempts to get the code running (it's using the GPU, but it's just as slow as CPU, and utilization is 0%). This is clearly not what happened in the original experiment runs. I've tried different cards, and different versions of CUDA, with no difference. Not sure what has changed since the code was first written.