-
How to run your code in a distributed training? I try to set "use_distributed: True" in your configuration file, but I found it is not work. I found it only support one gpu mode.
-
-
### Description
I use two 4090 host for data parallel distributed training by jax.distributed, like this:
jax.distributed.initialize(coordinator_address="[ip]:[port]",
num_processes=2,
process_id=…
-
### Description
Multi-node multi-*PU training. This is required for really scaling our use of the data pipeline for big predictions and given the construction of the pipeline as it exists, we just …
-
Machines
- dual 4090 ada
- dual A4500
- single A6000
- single A4000
- single 3500 Ada
Concentrate on A6000 and A4000 with 10gbps networking
- https://www.tensorflow.org/guide/distributed_trai…
-
Looking for a way to train alignn in a distributed fashion I stumbled upon this package.
It looks really nice but I could not get the distributed training to work on slurm.
One issues was that the t…
-
When running the command ```sh scripts/dist_train.sh 4 --cfg_file ...```
I get the following error.
```
further instructions
warnings.warn(
[2024-04-22 13:22:53,028] torch.distributed.run: [W…
-
-
Hi, I appreciate your repos. I've been using clip-iqa model in your repo for studying purpose.
It worked well on single-gpu setting when I follow your simple training scripts.
I want to use distri…
-
Hi, just wondering if distributed training works the way I think it does where GPU VRAM is shared between all available GPUs enabling larger batch sizes/higher resolutions training images etc... I am …