The reward signal for AIRL is not correct.

Stanford-ILIAD / Confidence-Aware-Imitation-Learning

MIT License

32 stars 4 forks source link

The reward signal for AIRL is not correct. #1

Closed Altriaex closed 1 year ago

Altriaex commented 1 year ago

Hi, I notice that the implementation for AIRL is not correct. You happens to use the reward signal for GAIL here.

https://github.com/Stanford-ILIAD/Confidence-Aware-Imitation-Learning/blob/1d8af0e4ab87a025885133a2384d5a937329b2f5/cail/network/disc.py#L203

It should be F.logsigmoid(logits)-F.logsigmoid(-logits), or simply logits.

syzhang092218-source commented 1 year ago

Hi thanks for pointing this out! Yes you are right. We are running all the experiments based on the correct AIRL reward now and will update that once finish.

syzhang092218-source commented 1 year ago

Thank you for pointing this out. Our AIRL implementation is based on the repo: https://github.com/toshikwa/gail-airl-ppo.pytorch/. We believe the repo is well-tuned with the new AIRL reward function that if we change that, the performance will drop significantly. With the new reward function, everything works fine. The new reward function is positively correlated with logits, so this won't cause problems. If we want to change that, a large amout of fine-tuning work will be needed.

Altriaex commented 1 year ago

Thanks. As it causes confusions to readers, it will be nice if you can document this point somewhere, e.g. the paper uploaded to arXiv.