I'm running Mixtral-8x7B-Instruct-v0.1-llamafile in server mode on an AWS g6.12xlarge EC2 instance with 4 NVidia L4 gpus. I'm using full gpu offloading (-ngl 999).
The EC2 instance is running Amazon Linux 3.
I'm using the OpenAI client to access the API using a for loop in python with rather large prompts. After a few iterations (usually about 6, but sometimes as few as 1 and as many as 8) the API stops returning responses and seems to hang.
I've tried tweaking various parameters, but nothing seems to help.
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
What operating system are you seeing the problem on?
Contact Details
rpchastain@protonmail.com
What happened?
I'm running
Mixtral-8x7B-Instruct-v0.1-llamafile
in server mode on an AWS g6.12xlarge EC2 instance with 4 NVidia L4 gpus. I'm using full gpu offloading (-ngl 999
).The EC2 instance is running Amazon Linux 3.
I'm using the OpenAI client to access the API using a for loop in python with rather large prompts. After a few iterations (usually about 6, but sometimes as few as 1 and as many as 8) the API stops returning responses and seems to hang.
I've tried tweaking various parameters, but nothing seems to help.
The service is started with this command:
Version
./Mixtral-8x7B-Instruct-v0.1-llamafile --version
gcc --version
nvcc --version
What operating system are you seeing the problem on?
Linux
Relevant log output