-
Objective: To train and evaluate a model on RAGTruth dataset
Settings:
OS: Ubuntu WSL
Python: 3.12.4
NVIDIA Driver Version: 536.23
CUDA Version: 12.2
Replication steps:
1. Git clone
2. Run…
-
OSError: Unable to load weights from pytorch checkpoint file for 'llama-7b-hf\pytorch_model-00002-of-00002.bin' at 'llama-7b-hf\pytorch_model-00002-of-00002.bin'. If you tried to load a PyTorch model …
-
Hi!
There appears to be an inconsistency in the behavior of the optimizer before and after wrapping with Fully Sharded Data Parallel (FSDP).
When FSDP wraps the optimizer, it seems to modify the s…
-
I would like to train a model using two or more machines. After setting up the default configuration file using accelerate config, it seems that when I call train_db.py, it is not actually using the c…
-
Personally I have found that monitoring grad norm is useful to understand stability of training. It is also useful to set an appropriate clipping value (though I don't think torchtune supports grad no…
-
### Describe the bug
When using DeepSpeed backend, training is ok but get stuck in `accelerator.save_state(save_path)`. If use MULTI_GPU, the process is OK.
The training script is
```
accele…
-
running dpo with Qwen meet flatten problem. FSDP config as follow
```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_w…
-
### Reminder
- [X] I have read the README and searched the existing issues.
### System Info
I'm using the latest llamafactory version.
### Reproduction
Hi, I'm trying to use qlora+fsdp …
-
running script:
```sh
export PYTHONPATH=.
accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \
./pipeline/train/instruction_following.py \
--pretrained_mode…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports…