fsdp Search Results - Githubissues

1000+ results
for fsdp

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

pytorch/pytorch #78774

[FSDP] Customizable gradient pre-divide for mixed precision …

### 🚀 The feature, motivation and pitch When training FSDP with mixed precision and fp16, we need to be aware of the reduced range and avoid over/underflow. If we sum all gradients for a large world …

rohan-varma updated 2 years ago
1
facebookresearch/fairscale #1123

Can exclude some layer parameter not to shard?

This [`default_auto_wrap_policy`](https://github.com/facebookresearch/fairscale/blob/main/fairscale/nn/wrap/auto_wrap.py#L61) function has parameter `exclude_wrap_modules ` for excluding module types …

robotcator updated 1 year ago
5
facebookresearch/xformers #920

`memory_efficient_attention`: `torch.compile` compatibility

# 🐛 Bug Using xformers.memory_efficient_attention with FSDP and torch.compile fails when using bfloat16, but works when using float32. It's unclear to me if this is an xformers bug, an FSDP bug, or…

achalddave updated 11 months ago
3
eric-mitchell/direct-preference-optimization #56

DPO did not achieve the expected experimental effect

I replicated the experiments of pythia28 on hh (Anthropic/hh-rlhf) using the open-source code. Here are some of the experimental results: **SFT1**: ~~~ python -u train.py exp_name=sft gradient_ac…

Vance0124 updated 8 months ago
2
lm-sys/FastChat #256

Unable to save the mode weights - GPU OOM

I am finetuning vicuna using 4 * A100-80G GPUs. I meet some problem after finish training, ``` {'loss': 1.3641, 'learning_rate': 4.815273327803183e-08, 'epoch': 0.97} {'loss': 1.35, 'learning_ra…

Jeffwan updated 12 months ago
10
Luodian/Otter #249

About GPU memory?

Hi! I'm using two A100 GPUs, each with 40GB of memory. This is the GPU memory utilization for my training. I'm almost reaching over 90% memory utilization on both A100 GPUs. ![image](https://github.…

zuwenqiang updated 1 year ago
3
kiddyboots216/lottery-ticket-adaptation #2

About mask generating and adaptation

Hello, Ashwinee Panda I was very impressed with your work and wanted to thank you for the excellent contribution. I am currently following the tutorial using the openbookqa task to finally experime…

HeeseongEom updated 2 months ago
2
pytorch/pytorch #104081

Distributing HSDP checkpoint writing for load balancing

### 🐛 Describe the bug In HSDP, the ranks within a replication group are equivalent in terms of their model and optimizer shards. In other words, any rank in a replication group can be selected to …

supriyogit updated 1 year ago
3
huggingface/accelerate #924

Need to do distributed training using 2 separate machines

Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are: _Machine 1_: c…

Sreyashi-Bhattacharjee updated 9 months ago
2
pytorch/pytorch #109675

[FSDP] UnpicklingError when calling save_state_dict in distr…

### 🐛 Describe the bug we are using multi-nodes training with FSDP, and we got the following error during checkpointing through `torch/distributed/checkpoint/state_dict_saver.py` ``` File "/opt/m…

shijie-wu updated 10 months ago
6

上一页 1...52 53 54 55 56 57 58...100 下一页

1000+ results for fsdp

1000+ results
for fsdp