-
As we discussed previously: https://github.com/kubeflow/training-operator/pull/2021#issuecomment-1987733922 we want to add more AI/ML examples to the Kubeflow Training Operator. Right now, most of our…
-
### 🚀 The feature, motivation and pitch
**Background**
DistributedDataParallel (DDP) uses `Reducer` to bucket and issue `allreduce` calls. The main entry point of `Reducer` is through the gradient …
fegin updated
3 months ago
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports.
…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports…
-
I am using a single GPU(A10) to run Bloom-560m model fine-tune, error, how to solve? I found similar problems in other projects, but I didn't know how to solve the problems in alpaca
https://github.c…
-
### 🚀 The feature, motivation and pitch
The [MultiheadAttention](https://github.com/pytorch/pytorch/blob/2fbe6ef2f866fe6ce42a950f2053f2f6b4bdab90/torch/nn/modules/activation.py) layer has a protected…
-
# Release Manager
@cp5555
# Endgame
- [x] Code freeze: Feb. 9th, 2024
- [x] Bug Bash date: Feb. 12th, 2024
- [x] Release date: Feb. 23rd, 2024
# Main Features
## MS-AMP O3 Optimization
-…
-
### Bug description
Hello! Thank you for the integration of fsdp in the lightning trainer - it's a game changer.
I tried to switch from `lightning==1.9.4` to the newest `lightning==2.0.4` but obs…
-
## âť“ Questions and Help
This should explain the case:
```python
import torch
from fairscale.nn.data_parallel import FullyShardedDataParallel
import os
os.environ['MASTER_ADDR'] = 'localhos…
-
For DDP, rank 0's weight is synced to all ranks before forward. For FSDP, it would be nice to have a way to do this so that different weights from different ranks will be consistent before the forward…