TimZaman / dotaclient

distributed RL spaghetti al arabiata
26 stars 7 forks source link

NaN's in loss #28

Closed TimZaman closed 5 years ago

TimZaman commented 5 years ago
$ kubectl logs job11-optimizer-master-0 -p
2019-01-27 16:29:21,969 INFO     main(rmq_host=job11-rmq.default.svc.cluster.local, rmq_port=5672, epochs=4 seq_per_epoch=32, batch_size=8, seq_len=256 learning_rate=0.0001, pretrained_model=None, mq_prefetch_count=4, entropy_coef=0.02)
2019-01-27 16:29:21,970 INFO     init_distribution
2019-01-27 16:29:21,970 WARNING  skipping distribution: world size too small (1)
2019-01-27 16:29:21,983 INFO     Checkpointing to: exp2/job11
2019-01-27 16:29:22,305 INFO     Found a latest model in pretrained dir: exp2/job11/model_000000484.pt
2019-01-27 16:29:22,305 INFO     Downloading: exp2/job11/model_000000484.pt
2019-01-27 16:29:22,431 INFO     Connected to RMQ
2019-01-27 16:29:22,571 INFO     iteration 485/10000
2019-01-27 16:29:25,229 INFO      epoch 1/4
2019-01-27 16:29:30,171 INFO      epoch 2/4
2019-01-27 16:29:34,918 INFO      epoch 3/4
2019-01-27 16:29:39,586 INFO      epoch 4/4
2019-01-27 16:29:44,268 INFO     steps_per_s=421.46, avg_weight_age=1.0, reward_per_sec=0.0202, loss=nan, entropy=nan
Traceback (most recent call last):
  File "optimizer.py", line 737, in <module>
    run_local=args.run_local,
  File "optimizer.py", line 693, in main
    dota_optimizer.run()
  File "optimizer.py", line 495, in run
    self.writer.add_histogram('losses', losses, it)
  File "/root/.local/lib/python3.7/site-packages/tensorboardX/writer.py", line 406, in add_histogram
    histogram(tag, values, bins), global_step, walltime)
  File "/root/.local/lib/python3.7/site-packages/tensorboardX/summary.py", line 146, in histogram
    hist = make_histogram(values.astype(float), bins)
  File "/root/.local/lib/python3.7/site-packages/tensorboardX/summary.py", line 168, in make_histogram
    counts = counts[start:end]
UnboundLocalError: local variable 'start' referenced before assignment
TimZaman commented 5 years ago

more:

kubectl logs job16-optimizer-master-0
2019-01-29 12:20:35,046 INFO     main(rmq_host=job16-rmq.default.svc.cluster.local, rmq_port=5672, epochs=4 seq_per_epoch=32, batch_size=8, seq_len=256 learning_rate=1e-05, pretrained_model=exp2/job14/model_000002970.pt, mq_prefetch_count=4, entropy_coef=0.0005)
2019-01-29 12:20:35,047 INFO     init_distribution
2019-01-29 12:20:35,047 WARNING  skipping distribution: world size too small (1)
2019-01-29 12:20:35,061 INFO     Checkpointing to: exp2/job16
2019-01-29 12:20:35,715 INFO     Found a latest model in pretrained dir: exp2/job16/model_000001345.pt
2019-01-29 12:20:35,716 WARNING  Overriding pretrained model by latest model.
2019-01-29 12:20:35,716 INFO     Downloading: exp2/job16/model_000001345.pt
2019-01-29 12:20:35,829 INFO     Connected to RMQ
2019-01-29 12:20:35,966 INFO     iteration 1346/10000
tzaman@Tims-Mac-Pro ~ $ kubectl logs job16-optimizer-master-0 -p
2019-01-29 12:18:39,109 INFO     main(rmq_host=job16-rmq.default.svc.cluster.local, rmq_port=5672, epochs=4 seq_per_epoch=32, batch_size=8, seq_len=256 learning_rate=1e-05, pretrained_model=exp2/job14/model_000002970.pt, mq_prefetch_count=4, entropy_coef=0.0005)
2019-01-29 12:18:39,109 INFO     init_distribution
2019-01-29 12:18:39,109 WARNING  skipping distribution: world size too small (1)
2019-01-29 12:18:39,123 INFO     Checkpointing to: exp2/job16
2019-01-29 12:18:39,678 INFO     Found a latest model in pretrained dir: exp2/job16/model_000001345.pt
2019-01-29 12:18:39,678 WARNING  Overriding pretrained model by latest model.
2019-01-29 12:18:39,678 INFO     Downloading: exp2/job16/model_000001345.pt
2019-01-29 12:18:39,805 INFO     Connected to RMQ
2019-01-29 12:18:39,982 INFO     iteration 1346/10000
2019-01-29 12:18:42,580 INFO      epoch 1/4
2019-01-29 12:18:47,466 INFO      epoch 2/4
2019-01-29 12:18:52,237 INFO      epoch 3/4
2019-01-29 12:18:56,930 INFO      epoch 4/4
2019-01-29 12:19:01,720 INFO     steps_per_s=420.13, avg_weight_age=2.7, reward_per_sec=0.0291, loss=nan, entropy=nan
Traceback (most recent call last):
  File "optimizer.py", line 737, in <module>
    run_local=args.run_local,
  File "optimizer.py", line 693, in main
    dota_optimizer.run()
  File "optimizer.py", line 495, in run
    self.writer.add_histogram('losses', losses, it)
  File "/root/.local/lib/python3.7/site-packages/tensorboardX/writer.py", line 406, in add_histogram
    histogram(tag, values, bins), global_step, walltime)
  File "/root/.local/lib/python3.7/site-packages/tensorboardX/summary.py", line 146, in histogram
    hist = make_histogram(values.astype(float), bins)
  File "/root/.local/lib/python3.7/site-packages/tensorboardX/summary.py", line 168, in make_histogram
    counts = counts[start:end]
UnboundLocalError: local variable 'start' referenced before assignment
TimZaman commented 5 years ago
$ kubectl logs job16-dotaservice-75c7df74f5-4f8j6
Error from server (BadRequest): a container name must be specified for pod job16-dotaservice-75c7df74f5-4f8j6, choose one of: [agent dotaservice]
tzaman@Tims-Mac-Pro ~ $ kubectl logs job16-dotaservice-75c7df74f5-4f8j6 agent
2019-01-30 07:54:23,144 INFO     main(rmq_host=job16-rmq.default.svc.cluster.local, rmq_port=5672)
2019-01-30 07:54:23,159 INFO     setup_model_cb(host=job16-rmq.default.svc.cluster.local, port=5672)
2019-01-30 07:54:23,190 INFO     Received new model: version=0, size=1207838b
2019-01-30 07:54:23,195 INFO     === Starting Game 0.
2019-01-30 07:54:23,195 INFO     Starting game.
2019-01-30 07:54:23,203 INFO     Player 0 using weights version 0
2019-01-30 07:54:23,210 INFO     Player 5 using weights version 0
Traceback (most recent call last):
  File "agent.py", line 718, in main
    await game.play(game_id=game_id)
  File "agent.py", line 641, in play
    action_pb = player.obs_to_action(obs=obs)
  File "agent.py", line 526, in obs_to_action
    world_state=obs,
  File "agent.py", line 479, in select_action
    action_dict = Policy.select_actions(head_prob_dict=head_prob_dict)
  File "/root/dotaclient/policy.py", line 214, in select_actions
    action_dict['enum'] = cls.sample_action(head_prob_dict['enum'])
  File "/root/dotaclient/policy.py", line 207, in sample_action
    return Categorical(probs).sample()
  File "/root/.local/lib/python3.7/site-packages/torch/distributions/categorical.py", line 110, in sample
    sample_2d = torch.multinomial(probs_2d, 1, True)
RuntimeError: invalid argument 2: invalid multinomial distribution (encountering probability entry < 0) at /pytorch/aten/src/TH/generic/THTensorRandom.cpp:298
2019-01-30 07:54:30,099 ERROR    Unclosed connection: Channel('127.0.0.1', 13337, ..., path=None)