Ran some additional benchmarks and updated the article

567-labs / fastllm

A collection of LLM services you can self host via docker or modal labs to support your applications development

MIT License

182 stars 23 forks source link

{ "downscale": 0.001, "batch_size": 512, "n_gpu": 10, "duration": 26.808036148999996, "batches_per_second": 10333, "extrapolated_duration": 26808, "extrapolated_duration_fmt": "7:26:48", "extrapolated_duration_cps_fmt": "7:26:49.929353" }

Here is the non async embed function for reference

@stub.cls(
    gpu=GPU_CONFIG,
    image=tei_image,
    # Use up to 10 GPU containers at once.
    concurrency_limit=N_GPU,
    retries=3,
)
class TextEmbeddingsInference:
    def __enter__(self):
        # If the process is running for a long time, the client does not seem to close the connections, results in a pool timeout
        from httpx import AsyncClient

        self.process = spawn_server()
        self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30)

    def __exit__(self, _exc_type, _exc_value, _traceback):
        self.process.terminate()

    @method()
    async def embed(self, chunks):
        """Embeds a list of texts.  id, url, title, text = chunks[0]"""

        texts = [chunk[3] for chunk in chunks]
        res = await self.client.post("/embed", json={"inputs": texts})
        embeddings = res.json()
        return chunks, np.array(embeddings)

567-labs / fastllm

Ran some additional benchmarks and updated the article #31