-
**Describe the bug**
When running GemmRS on two nodes, each with 4 A100 80G connected via NVLINK. Each node has 1 NIC to IB HDR200.
```
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py…
-
I was looking through a [flaky test report](https://github.com/dask/distributed/runs/6215173657?check_suite_focus=true) and saw this:
```python-traceback
--------------------------- Subprocess s…
-
hello, when run main_finetune.py till 238th row:
for param in fsdp_ignored_parameters:
dist.broadcast(param.data, src=dist.get_global_rank(fs_init.get_data_parallel_group(), 0),
…
-
## 🔨Work Item
**IMPORTANT:**
* This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
* DO NOT create a new…
-
Work out how to make debugging easier when the tests are distributed across hosts.
-
Hi authors!
This is nice work and congratulations on securing CVPR24!
I managed to deploy it on my machine and managed to test on some data in the test set and the pretrained model gave me amaz…
-
### Describe the bug
The long-lived circuits of Blazor server make distributed tracing not work as expected.
Since each circuit is effectively a long-lived request ... a lot of *activity* (pun i…
-
i updated ff to v132 b1 today and the sidebar doesn't expand on mouse hover.
rolled back to v131, it still works in v131.
-
Hello,
I have executed the following commands for training purposes.
`python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py rojects/configs/VAD/VAD_tiny_stage_1.py --…
-
I met the situation when I trained AllSpark on 2 RTX 3090. I have tried so many ways such as increasing 'timeout' of init_process_group, increasing NCCL_BUFFSIZE, set NCCL_P2P_LEVEL=NVL. But all of th…