-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue y…
-
The MPI Best Practices page needs to be updated to use the most recent commands for turning off high-speed internode communication. Specifically, the Intel command is not correct for the most recent v…
-
A thought, and part of a longer-term goal: can we implement an inter-node communication block? That is, could we implement an ethernet-based "virtual ring" that packetizes data up and sends it over fr…
-
Is the process stopping because I requested only 2 ideas to be generated?
I'm also curious about how to obtain the full paper.
I've been waiting for an hour, and the GPT API usage has been stu…
-
I want to use te's comm-gemm-overlap module to perform multi-node training, however the readme says this module only support single node. Does te have any plan for multi nodes support? And what effort…
-
Investigate ways to bring GPU utilization to as close as 100% as possible and maximize model throughput. Focus on multi-GPU on a single node.
Collecting some questions from me and @benczaja -- fee…
-
For distributed recipes, such as full_finetune_distributed, the gradients end up getting synchronized after each backward() pass instead of only once before the optimizer step. This results in signifi…
-
From working with prefect, I realized it is pretty hard to set flags specific to a unit-test which influence task execution (i.e. disabling some external actions which won't work in a specific test se…
-
This is not a "bug" but maybe more of a communication thing.
I've noticed that many of the pickers (Content Picker, Multi Node Tree Picker, Template Picker and more) will populate the selected stat…
-
`rmw_zenoh_cpp` sometimes fails to pass [system_tests/test_communication](https://github.com/ros2/system_tests/tree/rolling/test_communication) with the following error.
```txt
Traceback (most rec…