-
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model par…
-
I noticed in your reply that you consumed 39G with A100 training, I used four 4090 GPUs for training but still got an error showing Out of memory, I wonder if you could provide a version for multi-GPU…
-
So far [train_second.py](https://github.com/yl4579/StyleTTS2/blob/main/train_second.py) only works with DataParallel (DP) but not DistributedDataParalell (DDP). One major problem with this is if we si…
-
### 🐛 Describe the bug
I train the project with different machine
https://github.com/ultralytics/yolov5
machine 1
```
docker run -it --gpus all --rm -v $(pwd):/mnt --network=host nvcr.io/nvidia…
-
AttributeError: 'HandleControlledSequence' object has no attribute 'L' how to prefix it . I'm looking forward for your reply
-
When I try to search with optuna, on 8-gpus with ddp strategy training.
The sweeper will start 8 groups of different hyperparameters, so the params shape doesn't match on each gpu.
-
### 🐛 Describe the bug
The process is working correctly with DDP world size 1 but then with world size > 1 is going to hang with GPU 0 at 0% and GPU 1 fixed to max occupancy. I've replicated this bot…
-
Hello, I am trying to train a network using DDP. The architecture of the network is such that it consists of two sub-networks (a, b) and depending on input either only a or only b or both a and b get …
-
### Bug description
Training freezes when using `ddp` on slurm cluster (`dp` runs as expected). The dataset is loaded via torchdata from an s3 bucket. Similar behaviour also arises when using webda…
-
### General
- [x] Prepare scaling plots until end of february. Y-axis: the speedup we get when running one epoch through the model for 2,4,6,8,10 GPUs
- [x] Find out how many samples we have in the …