Closed NikolayTV closed 1 year ago
Hi! We are talking about CPU or GPU or Metal inference speed, windows or linux or mac os? It depends on many things: pc memory speed and number of threads and openBLAS or GPU offload, CUDA driver versions etc... Make a test script and try various possible options on your specific system and compare the execution time.
Also there are several inference frameworks, such as https://github.com/vllm-project/vllm or https://github.com/ggerganov/llama.cpp. They should be more effective than plain HF.
Hi. Im wondering if there is any parameter, or hack to increase speed? On what it depends? mb max_token_size or smthg else