567-labs / fastllm

A collection of LLM services you can self host via docker or modal labs to support your applications development
MIT License
182 stars 23 forks source link

Ran some additional benchmarks and updated the article #31

Closed ivanleomk closed 10 months ago

ivanleomk commented 10 months ago

Ran a benchmark with a non-async embed function and got this result

{
  "downscale": 0.001,
  "batch_size": 512,
  "n_gpu": 10,
  "duration": 26.808036148999996,
  "batches_per_second": 10333,
  "extrapolated_duration": 26808,
  "extrapolated_duration_fmt": "7:26:48",
  "extrapolated_duration_cps_fmt": "7:26:49.929353"
}

Key point is now that we can reduce the time taken by ~5.5 hours from 7.5hrs which is around 75% or almost 80% decrease. I'd love to try doing a benchmark with 50 or 100 GPUs to see if we can bring down the time taken with 100GPUs, async code and a batch size of 512 * 100 to around 30 minutes

ivanleomk commented 10 months ago

Here is the non async embed function for reference

@stub.cls(
    gpu=GPU_CONFIG,
    image=tei_image,
    # Use up to 10 GPU containers at once.
    concurrency_limit=N_GPU,
    retries=3,
)
class TextEmbeddingsInference:
    def __enter__(self):
        # If the process is running for a long time, the client does not seem to close the connections, results in a pool timeout
        from httpx import AsyncClient

        self.process = spawn_server()
        self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30)

    def __exit__(self, _exc_type, _exc_value, _traceback):
        self.process.terminate()

    @method()
    async def embed(self, chunks):
        """Embeds a list of texts.  id, url, title, text = chunks[0]"""

        texts = [chunk[3] for chunk in chunks]
        res = await self.client.post("/embed", json={"inputs": texts})
        embeddings = res.json()
        return chunks, np.array(embeddings)