-
https://github.com/pytorch/ao/blob/main/torchao/float8/fsdp_utils.py#L44-L48
Should it be `raise NotImplementedError("Only supports dynamic scaling")`?
-
Trying to finetune a model whose max seq length is 8k, _BAAI/bge-m3_. I'm trying to finetune on some retrieval task. Here's my trainer set up
```python
model = SentenceTransformer(model_id, de…
-
Following instructions in [HyperPod EKS workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/02-fsdp/02-train), trying to run FSDP EKS example on 2 p5 nodes is failing with the followi…
-
Thank you guys for your work!
i was using fsdp + qlora fine tuning llama3 70B on 8* A100 80G, and i encountered this error:
```shell
Traceback (most recent call last):
File "/mnt/209180/qis…
-
## 📚 Documentation
There were some common questions in FSPD regarding how to wrap the model, how `flatten_parameters` works etc. I think we should add a FAQ section to the https://github.com/pytorc…
-
Could I kindly inquire as to why, given the relatively small size of the tinyllama model, the Strategy was made to utilize FSDP (Fully Sharded Data Parallel) instead of DDP (Distributed Data Parallel)…
-
```
[rank0]: File "/opt/venv/lib/python3.10/site-packages/lightning_fabric/wrappers.py", line 411, in _capture
[rank0]: return compile_fn(*args, **kwargs)
[rank0]: File "/opt/venv/lib/pytho…
-
Hi, I try to resume my training from intermediate checkpoint file with `cfg.MODEL.WEIGHTS` & `no_resume=False` but it didn't work. The checkpointer cannot locate the checkpoint file as there are 8 fil…
-
### Bug description
The Pytorch Lightining is taking more memory than Pytorch FSDP.
I'm able to train the gemma-2b model but it takes 3 times more memory.
For openchat it goes out of memory.
…
-
Hi, I am running the code and installed the necessary packages as in the requirements.txt. It states pytorch-lightning==1.9.1 and when I run pip show pytorch-lightning
in the terminal it shows
```
…