-
Thanks for your wonderful work.
I try to pre-train instrcutBLIP from scratch on 4x4 A100. However, the GPU memory is slowly increasing as the training progresses, which leads to OUT-OF-MEMORY aft…
-
I’ve been using dask to work with a very large array without loading it into memory, and it mostly works well for that. But for some reason I can’t figure out, it will _sometimes_ entirely stop, ind…
-
Build a distributed cache implementation for Metro using http/smb (file-server) or ADO (artifact store).
Detailed notes are in https://github.com/microsoft/rnx-kit/discussions/983.
Developers …
-
Very interesting feature, I bumped into a similar problem with read_csv (~20k files ~1MB each) and landed on #4012.
Is there any similar feature for read_csv?
I tried to search but found none, als…
y-he2 updated
1 month ago
-
I am trying to do data analysis on the 9900 parquet files that in total they have 100GB size.
After 70K garbage collections warning:
`distributed.utils_perf - WARNING - full garbage collections …
-
With a couple of recent merges, I triggered yesterday another "CI stress test" that runs our suite a couple of times in a row (this time 10)
see https://github.com/fjetter/distributed/tree/stress…
-
**What happened**:
Running dask-yarn on EMR causes a repeating error in tornado on client creation.
**What you expected to happen**:
No error, just the client being created and being able to …
-
Thanks for great work!
When I run my inference code below using `deepspeed --include localhost:0,1,2 inference.py --model opt-iml-30b --dataset WQSP` I meet the error **exits with return code = -9**…
-
@kolia reported this issue with K8sClusterManagers@0.1.2:
```julia
julia> addprocs(K8sClusterManager(n_workers; pending_timeout=180, memory="1Gi"))
[ Info: driver-2021-05-18--20-31-35-wgssh-worke…
-
### 🐛 Describe the bug
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog ca…