-
### Describe the bug
Followed the guide examples/dreambooth/README_flux.md guide setting up and training, got cuda OOM with 3090Ti 24GB.
### Reproduction
PC got 256GB RAM
3090Ti VRAM 24GB
torch 2…
-
I am trying to do data analysis on the 9900 parquet files that in total they have 100GB size.
After 70K garbage collections warning:
`distributed.utils_perf - WARNING - full garbage collections …
-
### Is your feature request related to a problem? Please describe the problem.
In the runtime repo, we have included a bunch of built-in, in-memory `RateLimiter` implementations like `ConcurrencyLi…
-
Hi
I have multiple large-scale datasets and I need to write a dataloader for them with distributed sampler so it can be handled on TPUs and be used with pytorch XLA. could you guide me to any existin…
-
Per the log, it uses a ResNet101 model with a batch-size of 128 (per GPU).
This causes out-of-memory on at least two flavors of GPU drivers (ROCm and CUDA) w/16GB GPU memory.
`RuntimeError: CU…
-
### 🐛 Describe the bug
The function destroy_progress_group(group) is not working. As below code shows, after executing this function, the memory consumption did not decrease. By executing "del group"…
-
### 🐛 Describe the bug
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog ca…
-
We need to verify that the DomainDB and UrlDB states are checkpointed/savepointed properly.
For checkpointing, we need a test that enables checkpointing (in memory), causes the job to fail, and the…
-
(Some context of this is in https://github.com/dask/distributed/issues/2602)
## Summary
Workers should start taking memory generation into local scheduling policies. This affects both task prio…
-
Hello,
I am encountering an issue when running vg autoindex to construct a graph from a HG002 reference FASTA and VCF file. The command I am using is as follows:
vg autoindex --workflow map --thre…