-
I have a OP6 and would like to flash images from packages named similar to `enchilada_22_O.15_180810.ops`, making sure that all images from the `ops` are written to all the device's A and B partitions…
-
Edge doesn't contain "Server", and the metric should be latency, not Queries/s. Could you correct this?
![image](https://github.com/mlcommons/inference/assets/6924448/02c88f64-2073-495a-bcc2-f6d37b…
-
This feature proposal aims to improve the accuracy of task classification in our project by leveraging the capabilities of GPT-J and chatGPT, along with the use of cache. By using GPT-J, a small and o…
-
Hie 👋🏻
Coming from [this](https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/convert-h5-to-ggml.py) GGML conversion script and the issue that you commented in https://github.com/ggerganov/…
-
Dear author,
How can i run t5 like gpt-2 or gpt-j?
Thank
-
**Describe the bug**
Using DeepSpeed Inference (using `deepspeed.init_inference`) gives weird outputs when using batch size > 1 and padding the inputs.
I'll first state the problem with more detai…
-
The llama.cpp project already has an option to add `-pg` option with `LLAMA_GPROF=1`.
But it gets crashed when `llama-cli` is traced with uftrace as follows.
```
$ git clone https://github.com/gg…
-
### Feature request
Flash Attention 2 is a library that provides attention operation kernels for faster and more memory efficient inference and training: https://github.com/Dao-AILab/flash-attentio…
-
Hi! I'm working on reproducing your [Argo workflow for fine-tuning GPT-J](https://github.com/coreweave/kubernetes-cloud/tree/master/finetuner-workflow).
I'm able to create a PVC, download the da…
-
py", line 70, in set_module_8bit_tensor_to_device
new_value = bnb.nn.Int8Params(new_value, requires_grad=False, has_fp16_weights=has_fp16_weights).to(device)
File "/opt/conda/lib/python3.10/si…