Closed ivanleomk closed 10 months ago
Here is the non async embed function for reference
@stub.cls(
gpu=GPU_CONFIG,
image=tei_image,
# Use up to 10 GPU containers at once.
concurrency_limit=N_GPU,
retries=3,
)
class TextEmbeddingsInference:
def __enter__(self):
# If the process is running for a long time, the client does not seem to close the connections, results in a pool timeout
from httpx import AsyncClient
self.process = spawn_server()
self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30)
def __exit__(self, _exc_type, _exc_value, _traceback):
self.process.terminate()
@method()
async def embed(self, chunks):
"""Embeds a list of texts. id, url, title, text = chunks[0]"""
texts = [chunk[3] for chunk in chunks]
res = await self.client.post("/embed", json={"inputs": texts})
embeddings = res.json()
return chunks, np.array(embeddings)
Ran a benchmark with a non-async embed function and got this result
Key point is now that we can reduce the time taken by ~5.5 hours from 7.5hrs which is around 75% or almost 80% decrease. I'd love to try doing a benchmark with 50 or 100 GPUs to see if we can bring down the time taken with 100GPUs, async code and a batch size of 512 * 100 to around 30 minutes