-
Links to tracker issues for components planning updates in this. Please update you manifests before **June 21st**.
Release Date: **June 24th**
- [ ] ODH Operator
- [x] ODH Dashboard
- […
-
### 📚 Describe the documentation issue
Currently, [training_benchmark_xpu.py](https://github.com/pyg-team/pytorch_geometric/blob/master/benchmark/multi_gpu/training/training_benchmark_xpu.py) only su…
-
Hello, and thanks for sharing these great codes. Is it possible to use this trainer on multiple GPUs? I see that it is based on deepspeed but I can't find any configuration files for distributed train…
-
- [ ] Modify `pctrain` by adding a `--extract-features .opcfeat.bin` parameter. When set, execution should stop at https://github.com/uav4geo/OpenPointClass/blob/main/randomforest.cpp#L30 and https:/…
-
If we use the VILADistributedSampler (https://github.com/Efficient-Large-Model/VILA/blob/main/llava/train/llava_trainer.py#L274-L281) for Distributed Training, should the `gradient_accumulation_steps`…
-
Opening this issue to start a discussion about whether it would be worth investing to make it easy to run tensorflow agents K8s.
For some inspiration you can look at [TfJob CRD](https://github.com/…
jlewi updated
5 years ago
-
Hi, is there any instruction on multiple nodes multiple GPUs distributed training with hydra train?
https://fairseq.readthedocs.io/en/latest/getting_started.html#distributed-training
The fairseq doc…
-
@MohamedAfham I have succefully integrated the PyTorch DistributedDataParallel mechanism into your codebase, which accelerates the training procedure remarkbly and achieves a similar performance with …
-
we are using the Slurm Workload Manager, but when compile custom operators, bugs occur:
```shell
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
```
…
-
Hi,
We have distributed training example for python (resnet50_trainer.py) instead of a C++ version.
Do we have a similar example in C++ version, or could someone give a quick idea or hint for the di…