fsdp Search Results - Githubissues

1000+ results
for fsdp

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

bryandlee/Tune-A-Video #2

CUDA out of memory Error

![image](https://user-images.githubusercontent.com/20476674/212067731-50506295-9e27-41f3-ab25-558ade9e5fbb.png) It seems that the 32G GPU is not enough. How large memory a GPU is needed for normal op…

westfish updated 1 year ago
4
huggingface/transformers #34176

[Bug] transformers `TPU` support broken on `v4.45.0`

### System Info transformers: v4.45.0 and up (any of v4.45.0 / v4.45.1 / v4.45.2) accelerate: v1.0.1 (same result on v0.34.2) ### Who can help? trainer experts: @muellerzr @SunMarc accelerate exp…

steveepreston updated 1 month ago
22
axolotl-ai-cloud/axolotl #933

RuntimeError: Error(s) in loading state_dict for MistralForC…

### Please check that this issue hasn't been reported before. - [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports…

RicardoDominguez updated 7 months ago
13
pytorch/pytorch #140970

`torch._inductor.cpu_vec_isa.pick_vec_isa` takes ~9 seconds …

I noticed that when compiling a small microbenchmark (with inductor warm caching), E2E compile times were ~4s with cuda tensors and ~15s with cpu tensors. It looks like the majority of the extra time …

bdhirsh updated 6 days ago
10
ContextualAI/gritlm #66

saving the trained model for inference

The original run.py saves the model in pytorch_model.bin, which cannot be loaded directly using the code provided in this repository. After replacing line 422 `trainer.save_model()` in training/run.py…

zhj2022 updated 1 week ago
1
meta-llama/llama-recipes #711

Convert Llama-3.2-11B-Vision-Instruct FSDP Checkpoints to HF…

### System Info transformers: '4.45.1' ### Information - [ ] The official example scripts - [X] My own modified scripts ### 🐛 Describe the bug I have fine-tuned `Llama-3.2-11B-Vision-Instruct` fo…

marscod updated 1 month ago
4
huggingface/alignment-handbook #42

How to QLoRA training with ZeRO-3 on two or more GPUs?

I added a 4-bit load after the command LoRA training with ZeRO-3 on two or more GPUs to achieve a mix of QLoRA and ZeRO-3. But the program encountered the following error: RuntimeError: expected ther…

Di-Zayn updated 6 months ago
4
junjie18/CMT #27

performance on waymo dataset

hi authors, I am curious about the performance of the model on waymo dataset, but this was not mentioned in the paper. May I ask if you have conducted any relevant experiments and what were the res…

lcc815 updated 1 year ago
3
tianyi-lab/Cherry_LLM #24

The training bash script for FastChat is what?

Thank you very much for the work you have brought, which is very helpful for those of us with fewer training resources. I am a newcomer to the field of NLP and am not very familiar with training frame…

daidaiershidi updated 1 month ago
2
evolutionaryscale/esm #137

ESM3 for Distributed Training

Hi everyone! I want to retrain ESM3 in distributed way. How can I shard ESM3 model across GPUs? Thanks:)

DianGoodluck2024 updated 2 weeks ago
1

上一页 1...94 95 96 97 98 99 100...100 下一页

1000+ results for fsdp

1000+ results
for fsdp