NVlabs / NVBit

198 stars 18 forks source link

TensorFlow stuck in infinite loop when executed with NVBit. #100

Open JueonPark opened 1 year ago

JueonPark commented 1 year ago

I am using NVBit 1.5.3 to get the trace of TensorFlow application. When I run the application without NVBit, the TensorFlow application runs fine. However, when I run the application with NVBIt, the application is stuck in autotuning.

The running command is as the following: LD_PRELOAD=$TRACER_TOOL python run.py --batch_size 1024 --num_examples 2000

The error message ends as the following:

2022-09-14 03:50:07.595278: I tensorflow/core/common_runtime/executor.cc:813] Synchronous kernel done: 814 step -8366710006080086661 {{node decoder/basic_decoder/decoder/while/body/_10/decoder/basic_decoder/decoder/while/cond/output/_805}} = Merge[N=2, T=DT_FLOAT, _XlaHasReferenceVars=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Func/decoder/basic_decoder/decoder/while/body/_10/decoder/basic_decoder/decoder/while/cond/then/_797/input/_855, decoder/basic_decoder/decoder/while/body/_10/decoder/basic_decoder/decoder/while/cond/else/_798/decoder/basic_decoder/decoder/while/cond/TensorArrayV2Read/TensorListGetItem) device: /job:localhost/replica:0/task:0/device:GPU:0
2022-09-14 03:50:12.991962: I tensorflow/core/framework/model.cc:1513] Starting optimization of tunable parameters with HillClimb
2022-09-14 03:50:12.992094: I tensorflow/core/framework/model.cc:1563] Number of tunable parameters: 0
2022-09-14 03:50:12.992137: I tensorflow/core/kernels/data/model_dataset_op.cc:200] Waiting for 20480 ms.
2022-09-14 03:50:33.472384: I tensorflow/core/framework/model.cc:1513] Starting optimization of tunable parameters with HillClimb
2022-09-14 03:50:33.472532: I tensorflow/core/framework/model.cc:1563] Number of tunable parameters: 0
2022-09-14 03:50:33.472568: I tensorflow/core/kernels/data/model_dataset_op.cc:200] Waiting for 40960 ms.
2022-09-14 03:51:14.432780: I tensorflow/core/framework/model.cc:1513] Starting optimization of tunable parameters with HillClimb
2022-09-14 03:51:14.432929: I tensorflow/core/framework/model.cc:1563] Number of tunable parameters: 0
2022-09-14 03:51:14.432963: I tensorflow/core/kernels/data/model_dataset_op.cc:200] Waiting for 60000 ms.
...

The application is stuck on parameter autotuning. Even if I disabled the parameter autotuning, the application is stuck and does not end.

I am using cuda 10.1 with cudnn 7.6.4. I am using RTX 2080Ti GPU with Intel Xeon Gold CPU.

I am sorry for submitting issue specified to TF. Thank you.