-
Also running into a problem. It appears that the ETA for fitting the model after downgrading to 2.4.1 is running at like 100 hours even with using the Nvida T4. Even running the project directly…
-
@kohya-ss Hi, I trained a Lora for a dress on flux and it gives me blurry results. I am using it at weights 1 and 1.3
![0727-hcb_white_dress_a_woman_posing_in_a_gar-flux1-dev-1149062164](https://gith…
-
Hi All,
I get NAN in gradParameters when training with multiple GPU.
I have tried on both cuda 7.5 (two K80) and cuda 8.0 (two 1080P), and got similar error.
Any suggetion will be great appreciat…
-
My system
1. 2x Titan X
2. CUDA 7.5
3. CuDNN v3
With 1 x TitanX 20 iterations for training on Imagenet-1000 with Caffenet takes about 6.5ms.
`I1029 15:17:08.761509 20493 solver.cpp:236] Iteration 40,…
-
### What happened + What you expected to happen
The example script **self_play_league_based_with_open_spiel.py** found [**here**](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_a…
-
Hello, I would like to consult if you can use multi-gpu training, how to modify the code?
-
**Describe the bug**
Distributed training is getting stuck in the testing phase after loading saved model or throwing the EOFError: Ran out of input by running the following command from source
…
-
TorchMetrics support is pretty reliable nowadays and makes distributed training less annoying (no more World sizes, yay!). It also syncs well with Wandb logging and allows monitoring of training batch…
-
### 请提出你的问题 Please ask your question
Hi,
First of all, thank you for all your work.
1) I got a small question regarding training multi-gpu. I see that the GPU memory usage on the master node …
-
# Setup
- A multi-GPU rig, having top of the line GPUs:
- Several 3090 GPUs;
- Or several A100 GPUs;
- A `pytorch:1.7.0-cuda11.0-cudnn8-devel` container derivative;
- Latest `docker`, `nvid…