Cryolite / kanachan

A Japanese (Riichi) Mahjong AI Framework
285 stars 40 forks source link

V and Q Loss increasing when training with IQL #45

Closed shinkuan closed 1 year ago

shinkuan commented 1 year ago

The loss keeps increasing. What config I may have set wrong?

Is the LR too high? I was using default value.

Ran the training using: torchrun --nproc_per_node gpu --standalone -m kanachan.training.iql.train training_data=/workspace/data/annotate4rl_00000.txt num_workers=2 device=cuda encoder=bert_base decoder=double reward_plugin=/workspace/kanachan/kanachan/training/iql/get_reward.py discount_factor=1.0 target_update_rate=0.1 checkpointing=true batch_size=200 snapshot_interval=3000000 expectile=0.9

Reward function: (Pseudo Code)

reward = 0
reward += (6-shanten) * 2

if action is discardTile:
    if (じゅんめ) > 6:
        safeTileReward = evaluateSafeTile()
        reward += safeTileReward 

if action is Ron or Zimo:
    reward += 10

if game_rank == 1:
    reward += 10
if game_rank == 2:
    reward += 3
if game_rank == 3:
    reward -= 3 
if game_rank == 4:
    reward -= 10

return reward

VN VL

Q1N Q1L

Q2N 9f8) Q2L

train.log

KimamanaNeko commented 1 year ago

The goal of model training is to minimize the value of the loss function, thus a lower q1 loss or q2 loss value indicates that the model's predictions are closer to the actual outcomes, and therefore, the model is performing better. I think your reward signal are improperly set, this could cause the model's behavior to deviate from what is expected, leading to an increase in q loss.

shinkuan commented 1 year ago

thus a lower q1 loss or q2 loss value indicates that the model's predictions are closer to the actual outcomes

What about V loss, what does V loss represent?

I think your reward signal are improperly set, this could cause the model's behavior to deviate from what is expected, leading to an increase in q loss.

The reward function I gave tend to give more reward when close to the win (shanten) of the game. I don't know whether I should give reward base on the decision it made is good or base on how close it is right now to winning the game.

I tried to lower optimizer's learning rate. And that seems to solve the problem: Q1L Q1N Q2L Q2N VL VN

KimamanaNeko commented 1 year ago

v loss and q loss refer to the losses of the value function and the action-value function respectively. v loss reflects the accuracy of the model in evaluating state values it should be as low as possible