-
### Current Behavior
dependentTasksOutputFiles doesn't work when using Distributed Task Execution.
I suspect this has something to do with the timing of fetching outputs from dependent tasks fro…
-
Hello, thank you for your open source. When I train on my own dataset, an error message will be reported at the end of 1 epoch training. The error message is as follows:
2024-10-18 20:47:35,180 D…
-
I got error while running scalene with torch.distributed.run .
I am currently following this [doc](https://github.com/hustvl/VAD/blob/main/docs/train_eval.md)
```bash
python -m torch.distribute…
-
The scheduler currently relies on a crude heuristic to infer topologies that may suggest that certain tasks are "root-ish". If the tasks are detected as such, they are "queued" to avoid memory pressur…
-
Symptom: moco fails with exception in DistributedDataParallel:
```
Traceback (most recent call last):
File "/home/jovyan/work/triton-no-conda/pytorch/benchmarks/dynamo/torchbench.py", line 481, i…
-
Platforms: linux
This test was disabled because it is failing in CI. See [recent examples](https://hud.pytorch.org/flakytest?name=test_sparse_gradients_grad_is_view&suite=DistributedDataParallelTest&…
-
### 🐛 Describe the bug
```python
# fsdp_model with mixed precision (fp32 parameters)
fsdp_model.to(torch.bfloat16)
save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
with FSD…
-
**Describe the bug**
https://s3.amazonaws.com/clickhouse-test-reports/43779/cdf6f7e6349c69d6650ff7e4f51382aeaef2d44a/fuzzer_astfuzzerubsan//report.html
**How to reproduce**
``` sql
select number…
-
Hello!
I have no experience with Julia. From a technical standpoint, would it work to simply declare the for loop that iterates over every git repository that gets cloned to declare as `@Distribute…
-
Hi, awesome looking project.
I looked through the documentation/examples and through the code and it doesn't seem to address distributed environments use cases. (eg. when we want a function executi…