-
### Problem Description
Llama3 8B FP8 OOMs at the same batch size as BF16. I need to decrease the batch size to `2` for it to not OOM. At batch size 2, TE FP8 is **21% slower** than torch compile B…
-
General improvements
- [ ] Add in some hand-crafted trajectories into the mix via https://arxiv.org/pdf/2401.09241 (i.e. drive to goal, stop, slow, fast, even other algorithms perhaps, etc with sampl…
-
Hello,
I would like to ask you for the tip to setup and launch the Optuna execution on RAPIDS environment for hyper-parameter optimization using multi-GPU.
Currently, I am using [OptunaSearchCV]…
-
Hi,
Is there an example to use GPU in C++? Is it enough to add below lines to the code to use GPU?
```
OrtCUDAProviderOptions cudaOptions;
cudaOptions.device_id = 0;
sessionOptions.AppendExec…
-
The Matplotlib graph often freezes when manipulated to change zoom, axis tilt, or rotation. This was first noticed when generating 3d models for the tract regions. This effected the utility of the pro…
-
(First priority) for comparison: Llama-2 7B, T5-base
(Second priority) For being current, Llama-3.2 70B, T5-??? (try w/ HF first, then vLLM)
-
## Description
During the process of conducting source code reading and testing on LightGBM using a binary classifier, it was observed that the GPU performance during training is notably lower than th…
-
**Describe the bug**
When running SubCenter-ArcFace-r100-gpu the container starts but does not function reporting the following error (see logs for full errors)
I have not seen any worker / job when…
-
### Proposal to improve performance
We used the same GPU on two machines but different CPUs. The following experimental conclusions were drawn:
Experimental results: The GPU is 3090, and the CPU w…
-
**Description**
A clear and concise description of what the bug is.
![output_image](https://github.com/user-attachments/assets/bed4e808-a3e0-4225-96c4-04ae69c65a15)
**Triton Information**
…