Toni-SM / skrl

Modular reinforcement learning library (on PyTorch and JAX) with support for NVIDIA Isaac Gym, Omniverse Isaac Gym and Isaac Lab
https://skrl.readthedocs.io/
MIT License
445 stars 43 forks source link

Add Model-Based Meta-Policy-Optimization (MBMPO) #35

Closed juhannc closed 9 months ago

juhannc commented 1 year ago

Add Model-Based Meta-Policy-Optimization (MBMPO)

Introduction and description

Coming soon

Improvements in this PR

Coming soon

Proof of Work

Coming soon

Cheers,

Johann

juhannc commented 1 year ago

Hi @Toni-SM, for my work I have to implement MBMPO. But I wanted your opinion on something.

First, some words about MBMPO in case you are not familiar with it. MBMPO being a model-based algorithm, it uses a learned model to train the policy, and, as far as I understand, uses TRPO to train the (meta-)policy. But, the idea could probably be generalized to use other algorithms for the policy. Thus, I could either hard-code TRPO into the agent or pass another agent to the MBMPO.

The later would increase the flexibility but also would require a somewhat different __init__ function, something like:

class MBMPO(Agent):
    def __init__(self,
        models: Dict[str, Model],
+       agent: Agent,
        memory: Optional[Union[Memory, Tuple[Memory]]] = None,
        observation_space: Optional[Union[int, Tuple[int], gym.Space]] = None,
        action_space: Optional[Union[int, Tuple[int], gym.Space]] = None,
        device: Union[str, torch.device] = "cuda:0",
        cfg: Optional[dict] = None) -> None:

What's your take on that?

Toni-SM commented 1 year ago

The idea about generalization for other agents looks good... Whenever it is possible to avoid modifying the arguments of the agent constructors well. But there are cases where it is necessary as in AMP, or in the solution you propose for this agent

juhannc commented 1 year ago

Thanks for the quick feedback, will go that route then

Toni-SM commented 1 year ago

By the way, the current TRPO implementation iterates through learning epochs for both, the policy and the value... When it should be only for value ( the optimization of the policy should be excluded from this loop).