-
I recently implemented my own model using torch xla FSDP on GPU, but encountered an error message: "Check failed: ShapeUtil::Compatible".
`2023-05-26 17:36:08.508196: F external/org_tensorflow/ten…
anw90 updated
10 months ago
-
## ❓ Questions and Help
When running on vp-128 TPU pod (even when sharding only by batch dimension) we are experiencing very low performance comparing to the same pod without SPMD.
Do you have any…
-
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
-
I found the latest opensource LLM from google: Gemma has two version of model structure.
1. https://github.com/google/gemma_pytorch/blob/main/gemma/model_xla.py
2. https://github.com/google/gemma_…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports.
…
-
### 🚀 The feature, motivation and pitch
Some items we can add under torch_distributed_debug mode to improve debugability of FSDP:
- Shared parameter detection
- Logging when backward hooks are fi…
-
Definition of done:
Implement training large models using FSDP to accelerate training on large datasets.
Reference: https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/
-
Looks like there are some breaking changes to the FSDP API in PyTorch 2.1.
For example, `dinov2.fsdp.__init__.py::free_if_fsdp` is broken when using torch==2.1: `AttributeError: 'DinoVisionTransfor…
-
### 🚀 The feature, motivation and pitch
Is there a plan to add FP8 support for training?
### Alternatives
_No response_
### Additional context
_No response_
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports.
…