fsdp Search Results - Githubissues

1000+ results
for fsdp

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

mlcommons/algorithmic-efficiency #797

Support FSDP in JAX workloads

It is useful to shard optimizer state across devices (to save significant memory). This reflects current practice. We want to support it. * We want to switch from no sharding to naive model parameter…

priyakasimbeg updated 3 weeks ago
2
linkedin/Liger-Kernel #332

FSDP fails with [ lm_head ] layer

### 🐛 Describe the bug I'm trying to train LLaMA model with all linear layers + embeddings and head. Whilst embeddings have no problems with FSDP over Liger, there always exceptions when [ lm_head…

gotzmann updated 4 weeks ago
5
axolotl-ai-cloud/axolotl #2095

Mistral Nemo LoRA training has super high grad_norm

### Please check that this issue hasn't been reported before. - [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports. ###…

Nero10578 updated 4 days ago
5
facebookresearch/dinov2 #487

Merging FSDP's local_state_dict sharded model weights after …

Hello, I plan to use the student and teacher's weights of my DINOv2 model (I did the DINOv2 training with FSDP, 2 nodes and 16 GPUs in total (8 GPU per node)) for downstream use for a different distil…

chokevin8 updated 16 hours ago
1
huggingface/accelerate #3239

OOM error when training llama 7B model using Accelerate FSDP…

### System Info ```Shell - `Accelerate` version: 0.31.0 - Platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.35 - `accelerate` bash location: - Python version: 3.10.12 - Numpy version: 1.2…

JlPang863 updated 2 weeks ago
1
huggingface/trl #2294

OOM when finetuning Llama3.2-90B on 8xA100 80GB

### System Info trl, transformers: most recent on github python 3.10.11 ubuntu 22 package versions: ``` accelerate==1.0.1 addict==2.4.0 aiohappyeyeballs==2.4.3 aiohttp==3.10.10 aiosignal…

maximilianmordig updated 1 week ago
1
pytorch/torchtitan #654

meta device issue with float8 delayed scale

repro: ``` CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --float8.enable_fsdp_float8_all_gather --float8.scaling_type_weight "delayed" --metrics.lo…

weifengpy updated 1 week ago
8
aws-samples/awsome-distributed-training #445

FSDP Example ReadTimeoutError

``` 7: [rank80]: urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10) ``` Running FSDP example, 16 p5 nodes. The example w…

nghtm updated 1 month ago
2
pytorch/torchtune #1977

Gradient clipping doesn't work with FSDP CPU offloading

I am running the full finetune distributed recipe, when setting `clip_grad_norm: 1.0` and `fsdp_cpu_offload: True`, it raises error `RuntimeError: No backend type associated with device type cpu` …

acisseJZhong updated 1 hour ago
7
Lightning-AI/pytorch-lightning #20406

FSDP full state dict mangles fsspec path

### Bug description In `FSDPStrategy.save_checkpoint`, the `filepath` variable is transformed via https://github.com/Lightning-AI/pytorch-lightning/blob/3627c5bfac704d44c0d055a2cdf6f3f9e3f9e8c1/src/…

oceanusxiv updated 1 week ago
1

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for fsdp

1000+ results
for fsdp