-
I forked your wonderful program and updated the requirements.txt and a few other files to now work with python 3.10 and Pytorch 2.2.1.
I also fixed some of the warnings for DataLoader and the Pandas…
-
### System Info
I am getting the following error, but this error should not be there -
cannot import name 'ShardedDDPOption' from 'transformers.trainer'
I have the following versions installed - …
-
Occurred on https://github.com/dask/distributed/pull/6400 but I've seen it in other PRs as well
https://github.com/dask/distributed/runs/6524459060?check_suite_focus=true
```
______________________…
-
I'm using torch.distributed.rpc package to work on a distributed training POC, currently I'm seeing rpc package itself is using pickle and pickle not work well with some python features like generator…
-
## 🚀 Feature
### Pitch
Port https://github.com/pytorch/pytorch/blob/c4a157086482899f0640d03292e5d2c9a6a3db68/torch/distributed/fsdp/fully_sharded_data_parallel.py#L1069-L1194 to work with Thunde…
-
Support whole model activation offloading with FSDP - working in conjunction with activation checkpointing - via
https://github.com/pytorch/pytorch/blob/e9ebda29d87ce0916ab08c06ab26fd3766a870e5/to…
-
I modified deepspeed_sero3.yaml, set num_machines to 8 and num_processes to 8, and I got the following error, what else should I do to run SFT on 8 nodes platform. Thanks
```Traceback (most recent …
-
Currently the three primary reindex plugin requests do not implement our standard `AbstractXContentTestCase` testing infrastructure. This infrastructure would have detected issues like #43406.
The …
-
Platforms: linux, rocm
This test was disabled because it is failing in CI. See [recent examples](https://hud.pytorch.org/flakytest?name=test_distributed_checkpoint_state_dict_type1&suite=TestDistribu…
-
Platforms: linux, rocm
This test was disabled because it is failing in CI. See [recent examples](https://hud.pytorch.org/flakytest?name=test_distributed_checkpoint_state_dict_type0&suite=TestDistri…