-
HI ,
Thanks for your code release.
I have a question about Multi-GPU training command.
Is it possible to train with Multi-GPU(8) inside docker?
Like:python -m torch.distributed.launch --nproc…
-
## Description
Whenever I run this code, the dask job crashes and all the workers get lost and then the task just hangs forever. While if I provide small size files then the same code works fine. (
-
Example for (distributed) training
* Hugging face dataset
* tf dataset
* kaggle
* webdataset / pytorch
* Jax example
* keras example
Example for (distributed) inference:
* Clip batch
* effi…
-
### System Info
- `transformers` version: 4.41.1
- Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.23.1
- Safetensors version: 0.…
-
## 🚀 Description
Pipeline parallelism is a technique used in deep learning model training to improve efficiency and reduce the training time of large neural networks. Here we propose a pipeline paral…
-
As we all know, TensorFlow v0.8 has supported built-in distributed training, but I can't find any work around combinations of them. So, guys, could you introduce your plan on that?
-
Hi @dbolya ,
I'm training resnet50 on 4gpus. The gpu utilization is very low. However, when I train it on 1 gpu, the gpu utilization can be up to 60%. I'm planing to do distributed data parallel to…
-
Hi,
The training time is taking too long. I have a 40k annotated dataset of ironrod created using NViSII. With 64 GB ram, single Nvidia RTX 3060 6GB graphics, it took around 6 hours to generate 2 e…
-
For Ex. I have two GPU enabled server and both having 4 GPUs, How my training can be distributed over these two server ?
-
**What would you like to be added**:
Currently, the `status` of a jobset object consists only of two arrays: `conditions` and `replicatedJobsStatus`. This is great for providing detailed status of in…