-
**Description**
Triton does not clear or release GPU memory when there is a pause in inference. In the attached diagrams the same model is being used. It is served via ONNX.
![image (1)](https:…
-
**Problem**
I need to create a lot of small JSONs with a LLM. To do so I started with [Jsonformer](https://github.com/1rgs/jsonformer). However, since this is not maintained anymore and my colleagu…
-
Thank you very much for your open source contribution, which is very helpful for my current work!
But I encountered some problems. In version 1.2, when inferring the 93x480p version with an A800 80G …
-
-
I found the fastpath inference seems not supported, which is an optimization in
https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/transformer.py#L224.
May I know what is the rea…
-
```julia
module Foo
using Base.Experimental: @opaque
some_method(x) = 2x
make_oc() = @opaque (x::Int)->some_method(x)
precompile(make_oc, ())
end # module Foo
```
When using this, I …
-
Hi,
I tried to run the 7B INT4 LLM model on the NPU, but I found that the performance was not very good, only about 2 tokens/s. I found that maybe one of the reasons is that the NPU has been loading …
-
Nossa empresa
Company: US-Based 💵 Annual Compensation: $100k - $140k USD (Approx. R$550k - R$750k)
Descrição da vaga
🔍 Responsibilities:
Build tools to monitor inference infrastructure performan…
-
**Describe the bug**
Unable to optimize a model with device- cpu and precision int8. Ending up with KeyError: 'input_model' error
**To Reproduce**
Start with this example: https://github.com/micr…
-
目前测试推理速度很慢,用的 A100 推理,推理 生成 1 分钟的音频,有时推理时间能达到接近 1 分钟。请问有什么优化的方法麽?