-
when I excude this command:
./run_local.sh pytorch dlrm terabyte gpu --scenario Server --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt
then:
Using 8 GPU(…
-
**Description**
While running Triton inference server using `k8s-onprem `example, I am getting the below error:
`PermissionError: [Errno 13] Permission denied: '/home/triton-server`
This is com…
-
Hi and thank you for this amazing plugin. I work at a university with some dedicated GPU nodes, but my laptop doesn't have an NVIDIA GPU. I can run small areas locally, but I was curious if you had a …
-
### 🐛 Describe the bug
I run my server with this:
python3 ./ColossalAI/applications/Chat/inference/server.py /home/ubuntu/modelpath/llama-7b/llama-7b/ --quant 8bit --http_host 0.0.0.0 --http_port 8…
-
**Is your feature request related to a problem? Please describe.**
How can I see the the total batch size the dynamic batching creates in the logs?
I can see how many of the requests are grouped by …
-
In OpenPATH, a daily background analysis task fires off which re-runs a clustering model. For each user, the entire history of recorded labeled/unlabeled data is collected. A clustering model is train…
-
Is there a way ?
-
CausalLM 14B is a SOTA 14B chat model (take benchmarks with a grain of sault, fully compatible with LLaMA 2.
- GGML HF: https://huggingface.co/TheBloke/CausalLM-14B-GGUF
- HF: https://huggingface.…
-
**Describe the bug**
The PyTorch SageMaker endpoint cloudwatch log level is INFO only which cannot be changed without creating a BYO container.
Hence all the access including /ping besides the /i…
-
**Is your feature request related to a problem? Please describe.**
For now the Tensorflow and ONNX backends in Triton support thread controls ([here](https://github.com/triton-inference-server/tens…