-
I'm running nccl-test `all-reduce` between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed tha…
-
### System Info
- `transformers` version: 4.44.0
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.11.9
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.…
-
### 🚀 The feature, motivation and pitch
I can get a single checkpoint after using FSDP fine tune the model.
![image](https://github.com/user-attachments/assets/8c3019e3-458c-49e6-9cdd-3f868692df46)
…
-
### System Info
- `transformers` version: 4.44.0
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.11.9
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.…
-
See https://pytorch.org/docs/stable/distributed.tensor.parallel.html
llama 405b paper discusses using FSDP, pipeline parallelism, context parallelism, and tensor parallelism
It'd be relatively s…
-
Trying to finetune a model whose max seq length is 8k, _BAAI/bge-m3_. I'm trying to finetune on some retrieval task. Here's my trainer set up
```python
model = SentenceTransformer(model_id, de…
-
Thank you very much for the work you have brought, which is very helpful for those of us with fewer training resources. I am a newcomer to the field of NLP and am not very familiar with training frame…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.
### Exp…
-
When adding the NVMe drives, we changed what cards are installed in nodes 01 and 02, and also removed the bifurcating PCI-e card from the Mellanox cards in nodes 09 and 10. We need to update the mach…
-
When I'm trying to use `fp8_model_init` feature, it doesn't seem compatible with DDP. It throws an error:
`RuntimeError: Modules with uninitialized parameters can't be used with "DistributedDataParal…