-
Hi,I have some questions about dpo:
1. Is there any reason why choosing Nectar dataset to train offline vanilla dpo rather than using the same dataset as iterative dpo, for a possibly more fair comp…
-
## Task
Notes:
- the ITI is uniform over five different durations. We should do this, instead of geometric
## Behavior
Decision times are slightly longer following switches (left pan…
-
### What happened + What you expected to happen
I can’t seem to replicate the original [PPO](https://arxiv.org/pdf/1707.06347) algorithm's performance when using RLlib's PPO implementation. The hyp…
-
Thank you for sharing. But I run your Sumo models in Fig1,7 and 11
it often had errors
In Fig 7
---------------------------------------------------------------------------
NameError …
-
- Behavior session metadata
- Which fields should go to the behavior session metadata?
https://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/303#issuecomment-2062196234
- Existi…
-
https://github.com/chromiecraft/chromiecraft/issues/6087
### What client do you play on?
enUS
### Faction
Alliance
### Content Phase:
Generic
### Current Behaviour
it seems bes…
-
This document includes the features in LMFlow's roadmap. We welcome any discuss or contribute to the specific features at related Issues/PRs. 🤗
### Main Features
* Data
* [x] DPO dataset format…
-
**What problem or use case are you trying to solve?**
Sometimes models fail to do their job correctly, and we would benefit from starting all over from the beginning. There are a few examples of th…
-
Hi, currently reward_fn is independent from environment class (mbrl.models.ModelEnv) and accepts as input actions and next observation. In practice more general, dependent on environment parameters re…
-
### Describe the feature
The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at the same time.
…