-
### 🐛 Describe the bug
I am reading the source code or PyTorch DDP and using PyTorch profiler to measure the performance of NCCL allreduce operation. I understand the ncclAllReduce is an async call. …
lyppg updated
10 months ago
-
### Issue Description
Hi,
After creating the `ccrsr` virtual environment and running `python3 inference_ccsr.py`, I encountered the following issue:
```
ModuleNotFoundError: No module name…
-
Original distributed parameter server idea included only one mode: actually 1...X number of shards, and data is stored between them.
Funny, but for better performance on "casual" model sizes, it w…
-
It feels that when the user must specify these arguments for their respective distributions, it makes it impossible to specify an action like: "here is the distribution, please distribute your data ov…
-
### 🐛 Describe the bug
```python
import os
import torch
import torch.cuda
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.distributed.fsdp import FullyShardedDa…
-
### Pitch
[Redis recently relicenced their software suite](https://redis.com/blog/redis-adopts-dual-source-available-licensing/) to a [no longer free software license](https://lwn.net/Articles/9661…
-
## Week 1
- [x] Go fundamentals
- [x] Typescript fundamentals
- [ ] coreutils: `echo, env, cat, wc, head, tail, yes, true, false, tree` (use gobyexample to speed up things)
- [x] [Testing funda…
-
The guide should likely include materials for managers seeking best ways to support their distributed teams. What challenges and opportunities does remote work pose and afford? How can working more op…
-
Hi, size. I‘m using my own dataset to recurrent your work. I noticed you used slurm for training. But for me, I only can use distributed training with dist_train.sh to train my own project. But there …
-
## Week 1
- [x] Go fundamentals
- [x] Typescript fundamentals
- [x] coreutils: echo, env, cat, wc, head, tail, yes, true, false, tree (use gobyexample to speed up things)
- [x] Testing fundame…