-
你好,我按照脚本里默认的超参数(learning rate),以及论文提到各参数配置、偏好数据,在ALMA-7B-Lora上做CPO,但是训出来的模型输出大量重复前文甚至不翻译的情况,如下图(zh->en,raw_res 是没用utils里的clean函数的结果),请问是哪里没设好超参吗?谢谢你。
![image](https://github.com/user-attachments/asse…
-
I'm now trying to train llama3.1 with GRIT pipeline.
At first I directly change ``--model_name_or_path`` and run the training code (the training script I used is as follows)
```
#!/bin/bash
#SB…
-
I appreciate your great work in zero123.
I want to retrain zero123 on medical data. My dataset contains about 700 samples, using the same data processing method as in the paper. Each sample has 12 …
ys830 updated
1 month ago
-
**Is your feature request related to a problem? Please describe.**
Hello. I am a developer of Bitextor (https://github.com/bitextor/bitextor), which is based on Snakemake, and we are having issues ru…
-
This is a high level epic.
The Worker State Machine (`distributed/worker_state_machine.py`) can be exclusively updated through the `Worker.handle_stimulus` handler. _Most_ calls that change the wor…
-
### When did you clone our code?
I cloned the code base after 5/1/23
### Describe the issue
Issue: When I use deepspeed zero3 to pretrainning LLaVA-13B on 4 * A100(40G),I got an error shows below. …
-
Thank you for provide this project ,as the title say, i find this repo can not support cpu offload like this issue
https://github.com/huggingface/diffusers/issues/2531
Can you consider add this supp…
-
### 🚀 The feature, motivation and pitch
The DeepSeek V2 paper proposed a training methodology where both the LR and the batch size were on a scheduler.
Exact description is below, however essentia…
-
I'm working on the 32k long text SFT for Qwen2 72b. When I set **seq_parallel_world_size** to greater than one and **use_varlen_attn to true**, an error occurs.
After checking, the error message is a…
-
The Fluxion scheduler provides a `t_estimate` job annotation, which `flux jobs` displays by default in the generic `INFO` column for jobs in the SCHED state. This is very useful, but typically I have …