facebookresearch / rlmeta

RLMeta is a light-weight flexible framework for Distributed Reinforcement Learning Research.
MIT License
284 stars 28 forks source link

[Documentation] Tracking documentation site progress #12

Open xiaomengy opened 2 years ago

xiaomengy commented 2 years ago

The documentation site is under construction. We will track the progress here.

jahidhasanlinix commented 2 years ago

Does this framework can be used for job scheduling for a deep learning model like some used in spark cluster for job scheduling?

xiaomengy commented 2 years ago

Hi Jahid, Could you help clarify what do you mean by using this framework for job scheduling?

If you are saying to use this framework train a RL model, it is up to you as if you can have an environment that can be integrated into the Loop. The simplest way is to make an Environment which shares the similar interface as OpenAI Gym.

If you you are saying scheduling the distributed jobs for this framework, currently we are using this framework in our own clusters. And we will add the schedule scripts which uses submitit [https://github.com/facebookincubator/submitit] later. Maybe we will have some longer term plan to support different clusters.

We are building the documentation site with detailed tutorials. Once it is ready, you may find it is easier to follow.

DrJimFan commented 2 years ago

Thanks for this wonderful new framework! Do you have an estimated timeline for the doc/tutorial and benchmarking results against other frameworks?

xiaomengy commented 2 years ago

Thanks for this wonderful new framework! Do you have an estimated timeline for the doc/tutorial and benchmarking results against other frameworks?

Currently we are working on the documentation site. But to be honest, we also have other projects so we are still planning the priorities in H1 2022. A rough plan is to make the documentation site available within Q1 2022. The actual timeline should be shorter because we also want to make it ASAP. This issue is created for tracking the progress and we will update it here. If there are any questions, we can also discuss in this issue or by emails.

jahidhasanlinix commented 2 years ago

@xiaomengy thank you for your response. Actually I was thinking of distributed parallel computing for a deep learning model to perform some job/task Scheduling in the Spark cluster. And yeah, waiting for that documentation sites for better understanding, appreciate your time.

xiaomengy commented 2 years ago

@xiaomengy thank you for your response. Actually I was thinking of distributed parallel computing for a deep learning model to perform some job/task Scheduling in the Spark cluster. And yeah, waiting for that documentation sites for better understanding, appreciate your time.

@jahidhasanlinix In this case another thing you can try is to take a look at moolib which is the backend we are using for distributed RL. You may also consider to build something based on moolib as we did.

DrJimFan commented 2 years ago

Thanks @xiaomengy ! Currently, would you mind sharing any rough benchmarking results against Ray RLlib, Tianshou, Deepmind Acme, etc.? Like are we talking about +20% speedup, or 2x, 5x, given the same hardware resources? Are there any improvements or regressions in the final eval score achieved on the same Atari/Gym tasks? Thank you so much for your time and help!

xiaomengy commented 2 years ago

Thanks @xiaomengy ! Currently, would you mind sharing any rough benchmarking results against Ray RLlib, Tianshou, Deepmind Acme, etc.? Like are we talking about +20% speedup, or 2x, 5x, given the same hardware resources? Are there any improvements or regressions in the final eval score achieved on the same Atari/Gym tasks? Thank you so much for your time and help!

@LinxiFan, thanks for the suggestions. To be honest, we don't have too much benchmark results compared to other libs now. We will do more experiments and release that later. Here I'd like to explain a little more about why we do this. The basic design purpose is to support our own RL researches in different areas. In order to support different problems/envs, we have to have a flexible framework that can be easily modified to different use cases. As you know, for different use cases and algorithms, we may even to define different env-agent loops. So the flexible is our top priority. That's why we defined the Remote class to let the user can implement their own algorithms or loops easily. Based on this principle, we will try to improve the performance as much as possible.

Because of this, we are using a different design patten to implement distributed RL. We are not using EnvPool patten which run a batch of Envs and then interact with the agent. Here we are using the async loops to do that. We distributed the env-agent loops into different processes. While in each process, the different env-agent loops are running in async ways. So when one agent is running act operation with the model on GPU, the next env can run env step on CPU at the same time to improve performance. I attached one figure to show this below. In this way, we can let the user to implement their own loops or algorithms just as a single thread case with very limited modifications.

For example, user can easily override the loop implementation here for different use cases. https://github.com/facebookresearch/rlmeta/blob/main/rlmeta/core/loop.py#L179

So overall the flexibility to implement new algorithms for new environments is our top priority. So all of the loops and algorithms are implemented in python code. Then based on this we are trying to improve the performance. As you suggested, we will try to do some experiments to compare the performance with existing libs. However, it may be a little tricky to define a fair comparison. For example, Acme is built based on TensorFlow and Jax, while we are on PyTorch. And for RLLib, it is based on Ray while we are mostly based on our own implementations. And what's more for distributed envs, I believe currently envpool can provide best performance. But our framework can also integrate with envpool. I'd like to do some more experiments and release the results later with the documentation site. But here I just want to clarify a little more about why we built this and why we think the tutorial is important. Hope this can help explain some of your concerns.

Screen Shot 2022-01-06 at 12 26 39 PM
DrJimFan commented 2 years ago

Thank you so much for the explanation! This is very informative.