genixpro / kwola

An AI user that finds bugs in your software.
https://kwola.io/
MIT License
27 stars 9 forks source link

Rare GPU Error #2

Open genixpro opened 4 years ago

genixpro commented 4 years ago

Traceback (most recent call last): File "/home/bradley/venv/lib64/python3.7/site-packages/kwola/tasks/RunTrainingStep.py", line 646, in runTrainingStep results = agent.learnFromBatches(batches) File "/home/bradley/venv/lib64/python3.7/site-packages/kwola/components/agents/DeepLearningAgent.py", line 1499, in learnFromBatches "computeRewards": True File "/home/bradley/venv/lib64/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/home/bradley/venv/lib64/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward self._sync_params() File "/home/bradley/venv/lib64/python3.7/site-packages/torch/nn/parallel/distributed.py", line 515, in _sync_params self.broadcast_bucket_size) File "/home/bradley/venv/lib64/python3.7/site-packages/torch/nn/parallel/distributed.py", line 485, in _distributed_broadcast_coalesced dist._broadcast_coalesced(self.process_group, tensors, buffer_size) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete