-
**Describe the issue**:
When shutting down a UCX cluster with GIL contention monitoring enabled (i.e. `gilknocker` is installed and `distributed.admin.system-monitor.gil.enabled=true`), we get so…
-
### 🚀 The feature, motivation and pitch
Currently, the FX graph tracing (such as the one used in `aot_module`) seems not supporting collective functions such as `allgather`.
We may face some err…
-
### Branch
master branch (0.24 or other 0.x version)
### Describe the bug
use dist_test.sh to eval the same model,get different results every time!!!
./tools/dist_test.sh configs/violence/violen…
-
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string) + 0x99 (0x7fe76ab98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::W…
-
Screening should work with apps distributed as chrome extensions (like Ninja)
-
Hi, I notice a compile flag in the single device training script, which does not exist in the distributed training script though. Does distributed mode support it?
-
What is your level of interest/availability in working on the _Distributed Permanent Identifier Registry_ paper @jbenet @ChristopherA @aquabu @peacekeeper @talltree @dukedorje ?
I think we have the s…
-
Current pyobo includes annotations (in the sense of GO annotations, not OWL annotations) modeled as `relationship`s (i.e `S subClassOf R some O`).
An example of this is ec.obo:
```yaml
[Term]
…
-
Training stable diffusion XL unet using accelerate library with FSDP: fsdp_offload_params: true; fsdp_sharding_strategy: SHARD_GRAD_OP
Environment:
accelerate-0.34.2
torch-2.4.1
CUDA Version: 12…
-
## ❓ Questions and Help
export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file "config/file.yaml"
```
(torch1n) xxx@xxx-Super-Server:/media/hell…