-
Nice job!
My first concern is that how to implent RL in solving the problem in this paper. This project seems to use SFT (even in STAGE 2, it just uses a reward as a coefficient of the SFT loss) bu…
-
I followed your 'adding a new model' guide to add Mixtral. It appears transformers mixtral does not have a MixtralMLP as suggested by the guide. The other items can be imported OK. As a workaround …
-
plan proposé par @AntoninLagarrigue (Discord : Zinzolin)
**Module I: Introduction**
_Notions fondamentales_
1. Introduction
2. Bandit Multi-armes
3. Processus de décision markoviens
…
-
Currently, the policy network is actively trained to output 0 on a move whenever it's invalid. The move is never taken, so target_pi's is zero there, and this enters into the loss function as a result…
-
### System Info
System Info
Python version: 3.11.0
PyTorch version: 2.4.1 or 2.5.0
Transformers version: 4.46.0
TRL version: 0.11.4
PEFT version: 0.13.2
### In…
-
Here is another potentially useful burndown list to add features to the repo.
The CSS and HTML features are mentioned, in some way, in either State of CSS or State of HTML (over the past 4 years).
T…
-
# Summary
#### Link
[The Optin-Critic Architecture](https://arxiv.org/abs/1609.05140)
#### Author/Institution
Pierre-Luc Bacon, Jean Harb, Doina Precup
McGill University
## What is t…
-
```
★---> train stats after 160768 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l -ogps_train/reject…
-
I was wondering how fast this is supposed to run? Over 6 hours it only trained 2k steps according to the wandb output and I have an A6000.
I want to run it for 1 million env steps. I believe that i…
-
Hi, I want to ask that when training on DynamicViT at keep ratio **0.3**, the training will terminate due to loss nan problem, my training script is shown below.
Since I remember the original Dynami…