policy-gradient Search Results

1000+ results
for policy-gradient

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

daje0601/Google_SCoRe #1

About RL in solving the problem

Nice job! My first concern is that how to implent RL in solving the problem in this paper. This project seems to use SFT (even in STAGE 2, it just uses a reward as a coefficient of the SFT loss) bu…

huizyuan updated 1 month ago
2
AnswerDotAI/fsdp_qlora #43

Question about adding / training Mixtral

I followed your 'adding a new model' guide to add Mixtral. It appears transformers mixtral does not have a MixtralMLP as suggested by the guide. The other items can be imported OK. As a workaround …

chrismrutherford updated 7 months ago
1
ia-z/ia-z #45

Plan RL

plan proposé par @AntoninLagarrigue (Discord : Zinzolin) **Module I: Introduction** _Notions fondamentales_ 1. Introduction 2. Bandit Multi-armes 3. Processus de décision markoviens …

MohamedBsh updated 2 years ago
2
suragnair/alpha-zero-general #77

Policy network penalized incorrectly on invalid moves

Currently, the policy network is actively trained to output 0 on a move whenever it's invalid. The move is never taken, so target_pi's is zero there, and this enters into the loss function as a result…

Timeroot updated 3 years ago
15
huggingface/trl #2275

Conflict between last version of Transformers.Trainer and DP…

### System Info System Info Python version: 3.11.0 PyTorch version: 2.4.1 or 2.5.0 Transformers version: 4.46.0 TRL version: 0.11.4 PEFT version: 0.13.2 ### In…

lucasdegeorge updated 3 weeks ago
6
web-platform-dx/web-features #1624

Cover the features listed in the State Of XYZ surveys

Here is another potentially useful burndown list to add features to the repo. The CSS and HTML features are mentioned, in some way, in either State of CSS or State of HTML (over the past 4 years). T…

captainbrosset updated 2 days ago
8
nagataka/Read-a-Paper #15

The Optin-Critic Architecture

# Summary #### Link [The Optin-Critic Architecture](https://arxiv.org/abs/1609.05140) #### Author/Institution Pierre-Luc Bacon, Jean Harb, Doina Precup McGill University ## What is t…

nagataka updated 6 years ago
1
eric-mitchell/direct-preference-optimization #89

In DPO training, I got this ‘train stats after 160768 exampl…

``` ★---> train stats after 160768 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l -ogps_train/reject…

Alan-D-Chen updated 2 months ago
2
UT-Austin-RPL/amago #63

Training speed

I was wondering how fast this is supposed to run? Over 6 hours it only trained 2k steps according to the wandb output and I have an A6000. I want to run it for 1 million env steps. I believe that i…

Lan131 updated 5 days ago
5
JoakimHaurum/TokenReduction #2

Training error on DynamicViT

Hi, I want to ask that when training on DynamicViT at keep ratio **0.3**, the training will terminate due to loss nan problem, my training script is shown below. Since I remember the original Dynami…

King4819 updated 5 months ago
7

上一页 1...14 15 16 17 18 19 20...100 下一页

1000+ results for policy-gradient

1000+ results
for policy-gradient