-
Hello,
I am trying to finetune a llama3.1 on my custom dataset. I have access to a 2 nodes cluster with 4 gpus on each cluster. I am pretty new to finetuning on a multi node cluster. With whatever …
-
### 🚀 The feature, motivation and pitch
## Motivation: Limitation of Existing Profiling Approach
To conduct PyTorch distributed training performance analysis, currently a recommended way is profil…
wayi1 updated
2 years ago
-
When doing large scale runs, found that compiled_rmsnorm was producing aberrant loss curves compared to tp or async tp with rmsnorm.
Verified this reproes with small scale and thus opening issue for …
-
显卡配置:2张 V100 32G (共四张,有两张别人占用中,用完后可实现利用4卡V100)
按照默认accelerate配置报错:cuda out of memory,观察发现默认配置中 offload_optimizer_device 和 offload_param_device 参数均为none,后按照accelerate教程,将这两个参数均改成 cpu 报错:
![image](h…
-
Hello,
I am encountering an issue when running the following code snippet:
CUDA_VISIBLE_DEVICES=0,1,2,3` torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py \
--enable_fsdp \
--…
-
Hi, I was looking to open source contribute this repo, I saw these in TODO:
“think through support for Llama 3 models > 8B in size”
“make finetuning more full featured, more similar to nanoGPT (mi…
-
### Reminder
- [X] I have read the README and searched the existing issues.
### Reproduction
I pulled new code and ran Accelerate +FSDP + Qlora training, but encountered an error:
![image](https:/…
-
Let’s face it. KenLM has served us well…
…but it has its limitations. It didn’t aged well as a language model architecture.
First order of business is to compute a bi directional vector representa…
-
How to fine tune vicuna-7b with A40
-
Hi @philschmid,
When I try to increase the chunk length to be greater than 2048, the training fails and runs into an OOM error on g5.4xlarge.
Totally makes sense why it's happening, my question i…