JuliaReinforcementLearning / ReinforcementLearning.jl

A reinforcement learning package for Julia
https://juliareinforcementlearning.org
Other
583 stars 108 forks source link

Random Thoughts on v0.3.0 #24

Closed findmyway closed 4 years ago

findmyway commented 5 years ago

Here I'd like to share some random thoughts on this package in the following three aspects:

  1. Existing core components in current version(v0.3.0)
  2. What are missing to support distributed reinforcement learning algorithms?
  3. The ideal way to do reinforcement learning research.

Feel free to correct me if I misunderstand anything here.

What we have?

RLSetup

RLSetup is used to organize all the necessary information in the training process. It combines agent(learner and policy here), environment and some parameters (like stoppingcriterion, callbacks...) together. Then we can call learn! for training and run! for testing.

Comments:

  1. The concept of RLSetup is very common and useful in software development. (A very similar concept is TestSuite) And it makes the parameters of callback! function (which I'll describe soon) consistent. Because everything we need in a callback are all wrapped in a RLSetup! My only concern is that, different algorithms may need different kinds of parameters for (distributed) training and testing. It is a little vague to cover all these cases in a concept of RLSetup. It would be better to move the extra parameters (like stoppingcriterion, callbacks...) into the learn! and run! functions. And only keep the core components like learner, buffer, policy in the RLSetup.
  2. stoppingcriterion and callbacks seem to share some similarities. I tried to generalize these two here. I haven't test whether there's any performance decrease here. And by doing so it will let stoppingcriterion to have multiple criterions.

callbacks

callbacks are useful for debugging and statistics. Currently, to define a customized callback, we need to do something like this:

# 1. define a struct
mutable struct ReduceEpsilonPerEpisode
    ϵ0::Float64
    counter::Int64
end
# 2. extend the `callback!` function
function callback!(c::ReduceEpsilonPerEpisode, rlsetup, sraw, a, r, done)
    if done
        if c.counter == 1
            c.ϵ0 = rlsetup.policy.ϵ
        end
        c.counter += 1
        rlsetup.policy.ϵ = c.ϵ0 / c.counter
    end
end

Comments:

  1. I found that sometimes it is a little verbose to define a new struct. For example, to log the loss of each step I had to create an empty struct and print necessary info the the extended callback! function. I attempted to modify the callbacks a little to make it into a closure here. But sometimes closure is not that efficient (see discussions here https://github.com/JuliaLang/julia/issues/15276). So there's a tradeoff here. (I also noticed that in the recent versions of Flux.jl, some closure functions of optimisers are changed to struct based methods)
  2. Also the callback! function can be further simplified with a more general definition callback!(c, rlsetup, sraw, a, r, done) = callback!(c,rlsetup) considering that we don't need sraw, a, r, done in most cases.

Learner and Policy

Two core functions around a learner is selectaction and update.

And we already have several well tested learners

Comments:

  1. For me, the concept of learner is not very clear in the package (I mean it is too generic here and maybe we can decompose it into several common components?).
  2. I find that policy is sometimes included in a learner. (An example is deepactorcritic.jl)
  3. We'd better to draw a clear line between learners, actors?

Buffer

Here buffer is used for experience replay. One of the most useful buffer is ArrayStateBuffer. It uses a circular buffer to store experiences.

Comments:

  1. I tried to make the buffers more general here. But I'm still not very satisfied with the implementations. Also see discussions here and here. I'll document this part in details in the next section.

Traces

To be honest, I haven't look into the applications of this part. But by reading the source code, I'm wondering if it could be integrated into the concept of buffer. @jbrea

Environment

Environment related code has been split into ReinforcementLearningBase. As @jobjob suggested, we'd better to create a new repo (like Plots.jl I guess?) to support different backends. And we can have many different wrappers to easily introduce new environments. Preprocessors can also be merged into wrappers. I'll make an example repo later and have more discussions there.

Conclusion

In my humble opinion, the components listed above are clear enough to solve many typical RL problems in single machine. For continuous action space problems, @jbrea will take a look later. The only work left is to reorganize the source code a little and clearly define some abstract structs to guide developers on how to implement new algorithms. Some highlights in this repo are:

  1. Model comparison. This part will be very important in the future and needs to be enhanced to support distributed algorithms.
  2. A lot of predefined callbacks are very useful.

What are missing?

To compete with many other packages in RL, there's still a long way to go. And one of the most important part is to support distributed RL algorithms.

Typically, there are two directions to scale up deep reinforcement learning.

  1. To parallelize the computation of gradients.
  2. To distribute the generation and selection of experiences.

For the first one, we need an efficient parameter server and a standalone resource manager to dispatch computations. (I'm not very experienced in this field, you guys may add more details here.) Some questions in mind are:

  1. How to comunicate between learners and actors? pub-sub or poll?
  2. How to do failure tolerance? Maybe we can borrow some ideas from Ray.

For the seconde one, I think we need to carefully design the api first. Although there are many implementations here in Dopamine and here in ray, none of them can be directly ported into Julia (and I believe we can have more efficient implementations). Some critical points are:

  1. Shared Memory or not? I have had a long discussion on it with @jbrea before. Obviously it's more efficient to treat the next start state as the end state of current transition. I found that it will make the code much more complicated(forgive my programming skills in Julia, maybe we can find a way to address it). Also in the paper of Distributed Prioritized Experience Replay, as the last sentence in Adding Data, F IMPLEMENTATION, of Appendix states, Note that, since we store both the start and the end state with each transition, we are storing some data twice: this costs more RAM, but simplifies the code. So I guess I'm not the only one...
  2. Generalized enough to (Distributed) Prioritized Buffer There are many practical issues to be addressed.
    1. How easily add more meta data for each transition (id, priority, rank order, last active time...)
    2. How to queue batches from each actor?
    3. The general way to update distributed buffer?
    4. Support async?

Multi-agent

Although multi-agent scenarios are not considered in most existing packages, we'd better to think about it in the early stage.

Model Based Algorithms

Compared with Ray

According to the paper about Ray, there are three system layers:

  1. Global Control Store
  2. Bottom-up Distributed Scheduler
  3. In-memory Object Store

For me, the first and second part is relatively easier to understand and re-implement. But the third part is especially difficult for me to figure out how to do it in Julia. If I understand it correctly, Arrow, Plasma is used for processors in one node to avoid serialization/deserialization. I've checked the package Arrow.jl, it seems there's only data transformation and I still don't know how to manage a big memory shared Object Store in Julia across processors like the one in ray.

For the rllib part, the different levels of abstractions are really worth learning from.

Agent
└── Optimizer
       └── Policy Evaluator
               └── Policy Graph
                    └── Env Creator
                          └── Base Env, Multi-Agent Env, Async Env...

So for me, I'm more skilled in implementing the Env Creator part and I can also offer help to design the API of the other parts. But for the system design level, I really feel that I have a lot to learn.

What's the ideal way to do RL research in Julia?

  1. Easy to implement/reproduce the result of popular algorithms. I emphasize implementation here because so many RL packages just provide a function with a lot of parameters and hide a lot of details there(Just like saying, "Hey look, I've implemented so many fancy algorithms here" but in fact it's pretty hard to figure out what it is doing inside.). One thing I really enjoy while learning and using Julia is that I can easily check the source code to figure out the mechanisms inside and then to make improvements.
  2. Flexible to reuse existing packages. Like rllib(in Ray), we don't want to limit the users to any specific DL framework. The core components are always replaceable.
  3. Easy to scale.

TODO List

jbrea commented 5 years ago

Thanks, Jun, for this great summary! Here are some initial comments from my side (more to come later)

RLSetup/callbacks

  1. I agree, stopping criteria should be just callbacks. I like the way Flux.jl handles this but also the iterators approach of Knet.jl (there is also this nice blog post).
  2. In general I prefer structs over closures (I like to be able to e.g. define show and use fieldnames to see what data lies around; see also this discussion) but I think the way Flux.jl implements callbacks one could use either way. For example
    rlsetup = RLSetup(learner, buffer, policy)
    reward = Float64[]
    callback = () -> push!(rewards, rlsetup.buffer.rewards[end])

    Learner and Policy

    Currently, the learner contains the parameters that are necessary for exploitation and the policy contains extra information to specify exploitation. The function selectaction implements the exploration-exploitation trade-off through its dependence on both the learner and the policy. The function update obviously changes just the parameters. This is also the case for DeepActorCritic, despite the confusing fieldnames policynet and policylayer: these are just the parameters necessary for exploitation and it is only by choosing the default policy to be a soft-max policy that exploration becomes fully specified.

Maybe it should become clearer in the naming that we are really dealing here with the exploitation-exploration trade-off, but I think that this separation needs to persist. It should also in the future be possible to run DQN with an arbitrary choice of exploration method, even though the default one may be epsilon-greedy.

jbrea commented 5 years ago

Buffers/Traces

I would prefer to not store any data twice in any buffer. It is just nice to load the longest possible replay buffer fully onto a GPU for fast training; reducing the length of the longest possible replay buffer by almost a factor 2 because of redundant storage, seems not nice to me. I would prefer to have a clean and unambiguous access to the replay buffer. I think we never really discussed proposition #22. Maybe this could be a way forward.

I agree, traces are like buffers. They are used for methods with eligibility traces like TD-lambda.

Distributed/Multi-agent methods

Yes, we should really figure out how to design these before moving on :smile:. See also this question.

iblislin commented 5 years ago

About callbacks and Tracers

I think only a function callback!isn't enough. I will adopt the idea from JuliaML/LearningStrategies.jl. This package provides a tranning loop learn!: https://github.com/JuliaML/LearningStrategies.jl#metastrategy

There are five different types of callback functions for the differnet needs in different stages of trainning. "setup!, update!, hook, finished, cleanup!"

So, in case of RL, maybe we need to extend it for espodic task starting/ending.

Also, the tracers design is included in that package. For example, a iteration limiter https://github.com/JuliaML/LearningStrategies.jl#learning-with-a-metalearner

Or a dataframe tracer (it may be useful for tracing rewards): https://github.com/JuliaML/LearningStrategies.jl/blob/master/examples/dftracer.jl

I propose to introducing a familly of callback functions:

It's okay to changing naming, I just enumerated my idea casually.

findmyway commented 5 years ago

@jbrea I agree with the comments about Buffer and we can discuss details later. I think I still need a couple of days to figure out how to borrow some ideas about distributed algorithms from rllib .

@iblis17 Good suggestion! I also find that there are different callbacks in different stages in rllib.