-
Occurred on https://github.com/dask/distributed/pull/6400 but I've seen it in other PRs as well
https://github.com/dask/distributed/runs/6524459060?check_suite_focus=true
```
______________________…
-
### Describe the bug
The sharding of IterableDatasets with respect to distributed and dataloader worker processes appears problematic with significant performance traps and inconsistencies wrt to d…
-
https://github.com/dmlc/xgboost/actions/runs/11753771153/job/32747003155
```
E distributed.client.FutureCancelledError: ('_argmax-06657a445bd2e0d811c6ff48d5860817', 24) cance…
hcho3 updated
2 weeks ago
-
**Describe the bug**
If the training data does not live on NFS but on node-specific storage, the current logic in https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/m…
-
### Context:
We are following the [FSDP example](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/10.FSDP) and trying to understand the mechanism behind how differe…
-
**Describe the bug**
When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / v…
-
Next Demo Day: December 5th
---
See what the Dask community has been up to, or share some Dask work of your own. Demos are short and informal (~5-10 minutes). Have something you'd like to share? L…
-
Several tests in test_shuffle.py are very flaky.
If I change `.github/workflows/tests.yaml` as follows, to rerun the tests 20 times (ci1 + not ci1) per environment:
```
pytest distribut…
-
I started a [thread](https://github.com/rapidsai/ucx-py/issues/1072) in ucx-py, and now I have replaced ucx-py with ucxx, which resolved the blocking issue. However, in terms of performance, ucx is sl…
-
### Search before asking
- [X] I have searched the Ultralytics YOLO [issues](https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussion…