adik993 / ppo-pytorch

Proximal Policy Optimization(PPO) with Intrinsic Curiosity Module(ICM)
128 stars 27 forks source link

Do you also have an LSTM implementation? #1

Open Niels-Sch opened 5 years ago

Niels-Sch commented 5 years ago

I really love this implementation, and I see that LSTM is still in the TODO. Have you made any progress on this in the last two months or should I just do it myself?

adik993 commented 5 years ago

Hi, I'm glad someone found it useful. Unfortunately, I haven't got time yet to implement it. I definitely will one day, but I'm not sure when I will find some time for it.

Niels-Sch commented 5 years ago

I just finished implementing it. It's still a massive mess though with lots of hackery so I won't bother you with it, but I might clean it up and let you know if you'd like :)

I really like how clear every function is in your code. You make me want to improve my own coding.

adik993 commented 5 years ago

Heh, everything emerges from mess :) Yes, sure I'd be happy to see your take on it, it's always nice to have some reference during coding, especially with ML, where the devil is in the details.

Niels-Sch commented 5 years ago

I will :) I'm cleaning it while I'm figgering out how to connect the models to Java through onnx/tensorflow/keras.

I also changed some of the algorithm in my version. For example I'm normalizing the curiosity rewards and instead of using .exp() on the difference of the logs I'm using an approximation that doesn't explode. I also simplified some of the hyper parameters. I'm getting full solves of pendulum in a bit less than 20 epochs, so all your renders sticking up in the air. Btw your tensorboard logs are super useful! Because of that I realised that Tanh's are preferred in the agent model because they are slower than the Relu's, allowing the ICM to keep up. Also they're probably less jump-to-conclusiony making them more stable.

Also I'm not using the "recurrent" parameter yet since it makes saving the hidden states tricky while maintaining compatibility with the run_[...].py files, but I guess I'll figger that out after further cleaning.

tomast95 commented 5 years ago

Hi, I'm also interested in (statefull) LSTM implementation. Your implementation is very nice(inheritance and not too long files) and super usefull. I even learned new python thing from you - datatypes in functions declarations and its returns. And also how to use tensorboard... huuge thank you! :)

So far I have changed some of your code to use statefull LSTM and removed multienv to run on my env in single process(felt easier to work with). ICM now runs on each episode seperately (instead of your [n_env,batch_size,n_features] its [batch_size, n_timesteps, n_features]) and later its concated to [n_env_spisodes, batch_size, n_timesteps, n_features] for PPO training input.

But I have problems with diverging losses and rewards (viz my post here ). So now I'm curious if my approach with LSTM is correct.


Divergence persist even after reworking it to use batches on all places it uses models (ICM in reward and loss, PPO in getting old policies and training)