Closed alex-athanassakos closed 1 year ago
Hi @alex-athanassakos! Did you manage to get this to work?
I am running into another issue with a similar setup. It looks like the size of the weight tensor in the Hydra model (AFAIU - the value head) is 0, which I guess means the weight tensor is uninitialized. It feels different from your issue, where the list of tensors is empty — but maybe the root cause is similar?
I replicated the same issue with a different example file (ppo_sentiments_llama.py
), configs (below), and model (EleutherAI/pythia-70m
).
I use a custom trlx fork from trlx==0.6.0
with a few unrelated changes (treat as identical).
accelerate launch trlx/examples/ppo_sentiments_llama.py --model_path EleutherAI/pythia-70m
default_config.yaml
:compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: <path_to_config_dir>/deepspeed.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
deepspeed.json
):{
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"steps_per_print": 2000,
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 8,
"wall_clock_breakdown": false
}
Thanks for the info @nikebless. I never did get this to work. I had a similar issue to yours with the reward model in my custom script. The embeddings weights were uninitialized. I ended up using the ppo_sentiments.py
to debug, and posted the error it gave me because I thought it made things simpler than using my own script. But it sounds related to your issue!
Resolved with https://github.com/CarperAI/trlx/pull/489
🐛 Describe the bug
Hi!
I have been trying to use Accelerate with Deep Speed to launch my TRLX scripts and have been running into this error in several instances:
I just reproduced it with ppo_sentiments.py (commit c9ab683.
I used this to launch it:
accelerate launch --config_file accelerate_config_example.yaml ppo_sentiments.py
With
accelerate_config_example.yaml
consisting of:I am using a g5.12xlarge with Deep Learning AMI Neuron PyTorch 1.13.0 (Ubuntu 20.04) 20230330.
Let me know if you need more info!
Which trlX version are you using?
0.5.0
Additional system and package information
Python 3.8.10, transformers==4.28.0, Ubuntu 20.04, CUDA 11.7, torch==2.0.0