Open albertisfu opened 1 month ago
Interesting research, thanks Alberto. I didn't realize that we can use CPUs for the queries. That's great news.
Do we have an idea of how often we want to keep opinion text embeddings updated from the database?
As quickly as possible. I was hoping to do it in real time, like we do with keyword search.
Are we considering embedding versioning on S3? Should we preserve old versions for some time before deletion, or only keep the latest version of each embedding?
I think we can throw away old embeddings, BUT if we switch embedding models we should probably keep the embeddings for prior models. I think this just means we put the model name into the S3 path.
Do we have an idea of what the ideal batch size would be to get the most out of GPU resources?
I certainly do not, but I'd love to see the APIs in sentence transformers to see how they work. How do batches work in this world? Is it just that we send lots of things very quickly to the GPU, or do you need to send them in some sort of batch?
My understanding of how this works is pretty different than yours, Alberto, but I'm just going on intuition, so I'm probably wrong. But my assumption was that GPUs are fast because they can process individual requests in parallel, not because they process multiple requests in parallel. Interesting.
Do you know if running this model on CPUs would have any drawbacks compared to GPUs, besides speed?
I don't, no.
Web framework...
Yeah, I usually have a strong preference for Django, but FastAPI has gotten very popular very fast, and it is a leaner, meaner tool for this kind of thing. I'm open to it, but I'm half-convinced that we'd be better off doing what we know.
Generally, I think your architecture looks good, but I think my hope would be to do it all synchronously:
Could we have a microservice that uses a CPU sometimes and a GPU other times?
Right now, we'd start it on a machine with a GPU, and we'd use that to do all our batch work efficiently. Once that's done, we'd move the pod to a machine that uses CPUs instead. If we're lucky a couple CPU pods can run the models just fine on a day-to-day basis.
Could we do away with storing the text on S3? I'm not sure I understand what we gain by doing that.
Can we design the celery tasks so that sending the embeddings to Elastic is optional?
My thought is that we can have one task. Right now, it just saves the embeddings to S3, and we have a separate django command to pull the embeddings and put them in Elastic. Later, once the batch work is done, the celery tasks save to S3 and push to Elastic.
Thanks Alberto. What I would do in the past is split the decisions into 350 word chunks and send the chunks for embedding in batches of 50. I can share that code, if you need it. That worked just fine on my Tesla v100 GPU (16gb). From my experience, embedding with CPU is much slower, but we can run an experiment in our instance. Please let me know if you need anything from me.
Can you shed any light on how GPUs perform batching, @legaltextai?
as in code wise? speed wise? or processor wise? i think the batching, as in chunking texts + combining + sending to api, is done by cpu but i am not super knowledgeable about the division b/w cpu and gpu tasks by default.
I'm trying to understand how the GPU performs. Alberto said that it needs batches to use SIMD, so I'm trying to understand how that works.
As quickly as possible. I was hoping to do it in real time, like we do with keyword search.
Got it. In that case, I think we’ll need to generate a large initial embedding and index it into Elasticsearch, which will take some time (this can be done via a command as described above). After that, we can handle new opinions and updates by triggering them through signals, just as we do for regular indexing.
I think we can throw away old embeddings, BUT if we switch embedding models we should probably keep the embeddings for prior models. I think this just means we put the model name into the S3 path.
Sounds good. We can just override embeddings on updates.
I certainly do not, but I'd love to see the APIs in sentence transformers to see how they work. How do batches work in this world? Is it just that we send lots of things very quickly to the GPU, or do you need to send them in some sort of batch?
Well, in terms of the Sentence Transformer library, the encode method has a batch_size
parameter that defaults to 32:
embeddings = model.encode(chunked_texts, batch_size=32)
You send a list of chunks, and the encode method will take care of sending them to the GPU in batches of 32 chunks. As I understand it, if the chunked_texts
list is longer than the batch_size
, batches for all the chunks will be processed sequentially.
My understanding of how this works is pretty different than yours, Alberto, but I'm just going on intuition, so I'm probably wrong. But my assumption was that GPUs are fast because they can process individual requests in parallel, not because they process multiple requests in parallel. Interesting.
Yeah, I think that's right. A single operation can be parallelized, allowing it to finish faster. However, it depends on the size of the task you send. This blog post is interesting and includes some benchmarks.
It mentions using a book of 1,000 pages split into chunks of 900 characters, resulting in around 5,000 chunks.
Then, it assesses the performance and memory usage across various models:
They concluded that, depending on the model size, there is an ideal batch size where performance peaks and memory usage is balanced.
Processing batches of size 1 shows that it takes the most time to complete the whole task because the GPU resources are underutilized. In this task and with these models, the optimal batch size is around 5. Increasing the batch size not only does not show a boost in performance but sometimes results in worse performance, while memory usage increases.
So selecting a batch size is crucial, and it depends on the model and hardware used. Sergei mentioned that using chunks of 350 and batches of 50 worked well on the hardware he used. So, that can be a starting point, and we could perform some additional benchmarks to adapt the batch size based on the type of computing instances we’re going to use.
Generally, I think your architecture looks good, but I think my hope would be to do it all synchronously:
Do you mean that we may avoid using Celery if we can? Does that also include the initial batching work?
I think using Celery for processing the initial embeddings makes sense only if we plan to use an EC2 instance with multiple GPUs, each running the model. In that way, workers can divide the work across the different available GPUs reliably.
Could we have a microservice that uses a CPU sometimes and a GPU other times?
Right now, we'd start it on a machine with a GPU, and we'd use that to do all our batch work efficiently. Once that's done, we'd move the pod to a machine that uses CPUs instead. If we're lucky a couple CPU pods can run the models just fine on a day-to-day basis.
Yeah, I think that's possible. We’d just need to get the right settings to deploy models either on a GPU or CPU. Also, the pattern for processing work might change, using batches for GPU processing and sending concurrent requests for the CPU.
I think the microservice can have two endpoints: one for batch processing on the GPU and one for single requests on the CPU. This setup can work well for embedding search queries and opinion texts, but some benchmarking would be required to assess whether the CPU would be fast enough for processing large texts in a reasonable time.
Could we do away with storing the text on S3? I'm not sure I understand what we gain by doing that.
Yeah, I was thinking of storing texts on S3 because I initially thought an ideal batch size for efficient use of GPU resources would be a large number, which could mean that a batch of texts could size up to gigabytes of data. That would result in a huge HTTP request. However, now that it seems the ideal batch size can be lower than 50, I don't see a problem with just sending texts via HTTP directly to the microservice.
Can we design the celery tasks so that sending the embeddings to Elastic is optional?
My thought is that we can have one task. Right now, it just saves the embeddings to S3, and we have a separate django command to pull the embeddings and put them in Elastic. Later, once the batch work is done, the celery tasks save to S3 and push to Elastic.
Yeah, sure. This makes sense to me. For batch processing, we use a Django command to index all the initial embeddings into ES, and then for new embeddings and updates, the same service can take care of indexing them into elasticsearch as they are generated.
i agree with Alberto. let's run the test on CPU first for queries. will that instance be always on, with the model loaded into memory?
will that instance be always on, with the model loaded into memory?
Yes, the idea is that if the CPU instance with the model loaded into memory is fast enough to embed queries at search time, that instance will always be on. If it is also fast enough to embed opinion texts, we can use it to generate embeddings on a daily basis and no longer need the GPU instance after the initial batch work is completed.
would you help us run the benchmarks on a CPU?
of course. do you want me to run the test on my server (with cpu / gpu) or will i need an access to aws instance?
Unless Mike has a different opinion, I think it's okay to run the benchmark on your server, considering you can load the model into memory and execute the computations on the CPU. I believe that will give us an idea of how it performs on the CPU. If you share the resources you used on your server, we can then select something similar on EC2.
The idea behind this test is to measure the throughput per second for query embedding generation of the model on the CPU. We can consider an average query size. Currently, we don't have a defined average size for queries since we are not logging them yet. However, a couple of hundred characters might be a good starting point. I assume the query size will fit in a single chunk with a batch_size
of 1.
It would also be great if you could also measure the model’s throughput on the CPU using a large opinion text. Testing with the chunk size you used on the GPU and experimenting with different batch sizes. However, I'm not sure if the batch_size
is useful for CPU testing. Due to the way the CPU handles embedding computations, varying it might not be significant.
Here are my results for CPU vs GPU:
Task | CPU (time/avg) | GPU |
---|---|---|
Queries | 0.0772s / 0.0077s | 0.0223s / 0.0022s |
Paragraphs | 0.6064s / 0.0606s | 0.0458s / 0.0046s |
Here are the specs for my CPU and GPU:
Processor: x86_64
Physical cores: 28
Total cores: 56
CPU Usage: 8.7%
Max Frequency: 3600.00Mhz
Current Frequency: 1362.89Mhz
RAM Information:
Total: 125.76 GB
Available: 27.67 GB
Used: 48.13 GB
Percentage: 78.0%
GPU Information:
GPU Available: Yes
Number of GPUs: 1
GPU 0: Tesla V100-PCIE-16GB
GPU 0 Memory: 15.77 GB
Here is the notebook, if you would like to replicate in your instance
Generally, I think your architecture looks good, but I think my hope would be to do it all synchronously:
Do you mean that we may avoid using Celery if we can?
Oh, no, I just mean that we should take out the step of putting the text on S3 before processing it. I think Celery is probably a good tool for parallelizing things.
Does that also include the initial batching work?
I imagine Celery would be a good tool for this like usual, but the goal is to pull objects from the DB, send them to the microservice, and keep it saturated. The simplest way to do that is the goal, I think.
I think the microservice can have two endpoints: one for batch processing on the GPU and one for single requests on the CPU.
My hope was that we can just use the CPU after we've done the initial embeddings, so I'm hoping that one endpoint will work. It could take a list of chunked_texts, and return a list of embedding objects. If it has a GPU available, it uses that. If not, then it uses the CPU.
Do developers need to choose the CPU or GPU when making their call to the microservice?
Yeah, sure. This makes sense to me. For batch processing, we use a Django command to index all the initial embeddings into ES, and then for new embeddings and updates, the same service can take care of indexing them into elasticsearch as they are generated.
👍🏻
Yes, the idea is that if the CPU instance with the model loaded into memory is fast enough to embed queries at search time, that instance will always be on. If it is also fast enough to embed opinion texts, we can use it to generate embeddings on a daily basis and no longer need the GPU instance after the initial batch work is completed.
Exactly. 🎯
Here are my results for CPU vs GPU:
These are great, thanks! I think my takeaway is that we can definitely do queries on a CPU in real time, no problem. (Average of 0.007s is great!)
The per paragraph speed seems to be:
0.06s
on CPU, and0.005s
on GPU.So if a doc has 100 paragraphs, that's six seconds on the CPU and half a second on the GPU. I think that's fine for ongoing updates, and that we'll want the GPU for the initial indexing (no surprise).
What do you guys think?
I am totally OK with GPU for embedding texts in batches and CPU for queries. As long as we have the CPU of similar specs as mine. If not, we 'll need to run the tests again. I presume the ES search phase should be constant, regardless how the embeddings were embedded.
Our CPUs should be fine, yep! Great.
Alberto, I think this means we've got our architecture in good shape? What else is on your mind?
Great, the CPU looks quite promising. I'll refine the architecture diagrams according to your latest comments and come back so we can agree on them. So we can start discussing a plan for its implementation.
Based on your comments and suggestions, we can have an embedding microservice that works synchronously, performing only two tasks: splitting texts into chunks and generating the embeddings.
The goal is for the sentence-transformer model to run on either a GPU or a CPU. The GPU version can be used for the initial text embedding batch work, while the CPU version can handle daily work embedding queries and texts (if it can keep up with our workload).
The microservice can accept either one text to embed (for queries) or multiple texts for embedding opinions, in order to reduce the number of HTTP requests during batch work. The response will be either a single embedding (for a query) or multiple opinion text chunks and their corresponding embeddings.
The general idea is that this microservice can be scaled horizontally according to available resources. However I have some comments/questions to ensure the microservice scales properly and operates without bottlenecks:
OMP_NUM_THREADS=1
, which will fix the locking issue. But this configuration probably won't take full advantage of multiple GPUs in the pod. This would require some testing.Given this, probably the simplest solution to start with is to set 1 uvicorn worker per microservice instance with one GPU allocated and let the microservice scale horizontally while using a load balancer in front of instances.
For the CPU embedding processing, similar questions arise. It seems that we'll have the same issue of each worker trying to load the model into memory. If so, we can apply the same configuration described above of 1 worker per microservice and scale it horizontally. This will require allocating the right amount of RAM to load the model and do the processing work. We'd also need to decide the number of vCPUs to assign to the pod.
The alternative seems to be doing something to make the model shared in memory so it's available for all workers, and multiple CPUs can do embedding work using the same model in memory.
For the initial batch work, the embedding generation/indexing architecture will look like this:
We'll have a Django command that pulls Opinion texts from the database within a Celery task. It can retrieve multiple Opinion texts at once, allowing us to request embeddings for many Opinion texts in a single request, thus saving on HTTP requests. We'll need to determine an optimal number of texts per request.
Considering our microservice instances will have a single uvicorn worker, we'll need to ensure we don't send them more tasks than they can handle. We can solve this by setting up an equivalent number of Celery workers and using throttling based on the queue size.
The microservice will return a JSON response containing the opinion_id
, text chunks, and the chunk embeddings for each Opinion. This JSON will be stored in S3 by the same Celery task or a different one.
We can then have a separate Django command that pulls the Opinion chunks and embeddings from S3 and indexes them into ES. Having this in a separate command and task will help us take advantage of ES bulk updates, allowing us to index many opinion embeddings in a single request. This can involve a different number of opinions than requested for embeddings, according to ES load, so it can be throttled at a different rate.
Day to day work
After the initial work, for day-to-day operations the idea is to integrate Opinion text embedding generation before the ES indexing work on the ES Signal processor. So when there's a new opinion where the text is not empty, or the text field changes, the embedding generation task will be included in the chain. The microservice will return the chunks and embeddings, which will be stored in S3 and indexed into ES in the following chained task.
For text query embeddings, on every case law search request, the text query will be sent to the microservice for its embedding. The returned embedding will then be used to generate the ES semantic search request.
One thing to note is that we will need to prioritize text query embeddings over opinion text embeddings, as text queries will be used at search time. To accomplish this, we could have a namespace in Kubernetes where we run Opinion text embedding pods equivalent to the number of Celery workers. We could then have a different namespace with pods specific for text query embeddings. So they're prioritized over Opinion text embedding work which can take longer.
Let me know what you think.
probably the simplest solution to start with is to set 1 uvicorn worker per microservice instance with one GPU allocated and let the microservice scale horizontally while using a load balancer in front of instances.
Yes, sounds right to me.
The alternative seems to be doing something to make the model shared in memory so it's available for all workers, and multiple CPUs can do embedding work using the same model in memory.
Memory tends to be the thing we run out of and that we pay more for, so it's probably worth seeing how hard this is. But even if we find a way to have, say, 4 or 8 CPUs configured for a pod while only loading the model once, I don't think there's a way to auto-scale that except by adding another 4 or 8 CPUs at a time, so that doesn't really work unless we're tuning the CPU allocation by hand.
@legaltextai, do you know how much memory this model uses? I don't entirely understand your stats on that above?
We'll need to determine an optimal number of texts per request.
I'd guess this is less about how many opinions to do at once and more about how long those opinions are.
One thing to note is that we will need to prioritize text query embeddings over opinion text embeddings, as text queries will be used at search time.
I think once we're using the CPUs for this, k8s will scale things nicely for us. We just have to maintain enough overhead in our k8s configuration for the deployment such that when opinions are scraped, user queries still have responsive pods — I think!
Overall, I think we've got a plan here though, thanks. Alberto, do you want to write out the steps that we'd want to take for this?
as i understand, the rough calculation for memory requirements goes smth like this: 109M (model size in our case) x 64 bit parameters (8 bytes) (for our model) x 1.5 x some overhead (20%?) -> ~ 2.7gb?
Memory tends to be the thing we run out of and that we pay more for, so it's probably worth seeing how hard this is. But even if we find a way to have, say, 4 or 8 CPUs configured for a pod while only loading the model once, I don't think there's a way to auto-scale that except by adding another 4 or 8 CPUs at a time, so that doesn't really work unless we're tuning the CPU allocation by hand.
I see, yeah scaling a POD with multiple CPUs might not be as efficient as we want. If we figure out how to make the model shared in memory for different workers, maybe we can start with pods with a small number of CPUs? Let's say 3: one for regular work and two for model embedding. At least it'll be better than having a POD that can only process one request at a time with the whole model loaded into memory
Overall, I think we've got a plan here though, thanks. Alberto, do you want to write out the steps that we'd want to take for this?
Of course, I'll describe the steps/parts that we need to build so you can decide how they should be prioritized and assigned. We can also create independent issues for them.
This can be divided into 3 tasks.
Consider:
Validate request body and handle according to request type, I'd suggest:
Query embeddings request body:
{
"type": "single_text",
"content": "This is a query"
}
Opinion lists request:
{
"type": "opinions_list",
"content":[
{
"opinion_id": 1,
"text": "Lorem"
},
...
]
}
Query embeddings response:
{
"type": "single_text",
"embedding": [12123, 23232, 43545]
}
Opinion lists response:
{
"type": "opinions_list",
"embeddings": [
{
"opinion_id": 1,
"chunks": [
{
"chunk": "Lorem",
"embedding": [12445, ...]
}
]
},
...
]
}
Error Handling
Review and handle error types the embedding endpoint can return so we can differentiate between transient errors (e.g., ConnectionError
) and bad requests with appropriate HTTP status codes, so we can decide on the client side whether to retry the request or not. For example:
400 Bad Request
422 Unprocessable Content
Authentication Determine if the microservice requires authentication. If for internal use only, authentication can be omitted (similar to Doctor)
This task consists of implementing the Django command to perform the initial batch work. It basically consists of two parts:
This is related to your comment:
I'd guess this is less about how many opinions to do at once and more about how long those opinions are.
I'm thinking that the command can only iterate over all the Opinion pks in DB so we can just send a chunk of pks to the Celery task that will have to do the work:
In that way, the task would need to only hold opinion pks. However, the disadvantage is that some requests can be just a few KBs while others can be many MBs.
This can be done after having the previous commands since it can reuse the Celery task used for the command.
The last piece is to create a method that can be called within the Case Law semantic search as a previous step to send the query to ES. It'll be as simple as calling the embedding microservice synchronously and using the response to build the semantic ES query. This method should be called for either the frontend and the Search API if we're also considering making semantic search available via the API.
Let me know what do you think.
This sounds great to me. @legaltextai, do you think you can split this off into smaller issues and start tackling each of these with Alberto's help?
i can share the script that will take opinion_id + opinion_text from our postgres, split decisions into 350 word chunks, embed those and send to s3 storage with the opinion_id_number_of_chunk_model_name. i don't have an access to our s3, so i 'll leave those destination fields blank. thinking out loud here, do we really need an api for this task?
Well, the API for doing query vectorizing will be almost identical to the one doing text vectorization, and we'll need that, so it seems like making it is the right thing to do.
This sounds great to me. @legaltextai, do you think you can split this off into smaller issues and start tackling each of these with Alberto's help?
In terms of splitting into tasks, these are my ideas:
@mlissner @legaltextai As we agreed here we can discuss about the architecture for the microservice to generate the embeddings required for semantic search.
From my understanding we'd require two services for processing embedding: Synchronously: To generate query embeddings so they can be used on search time. Asynchronously: To generate opinion text embeddings.
Correct me if I'm wrong but based on my reading about embedding generation. GPUs are fast and efficient for processing large batches of embeddings due to their ability to leverage SIMD processing. However, processing small batches or a single embedding at a time may not be an efficient use of GPU resources.
If that’s correct, we need to determine the ideal batch size to efficiently utilize GPU resources. Some benchmarking may be required for this.
Here are some numbers for the model we're going to use: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
According to this, the throughput on a V100 GPU is 4,000 queries per second, while on a CPU, it's 170 queries per second.
I assume the GPU throughput refers to batch processing, while the CPU throughput refers to single-thread processing. If that’s the case, unless we expect a high volume of search queries (enough to allow us to accumulate thousands of queries depending on the batching threshold we determine to process them in real-time), using a GPU for real-time search could be inefficient.
Therefore, an alternative could be to run the synchronous embedding service on CPUs instead of GPUs. This may be fast enough for generating query embeddings by utilizing multi-core CPU instances.
The general architecture for this service would look like this:
One advantage is that this service can be scaled horizontally.
Asynchronously service: To generate opinion text embeddings.
For processing opinion text embeddings, we can take advantage of GPU batch processing. We'll still need to determine the ideal batch size, but considering the large volume, holding the opinion texts in memory and sending them over the network to the async embedding generation service might be inefficient.
An alternative approach could be to have a Django command that extract texts from the database that need embedding generation, and store them in a batch JSON file on S3. Initially, we can retrieve all opinion texts for the first generation, and based on the frequency we want to keep the embeddings in sync with the database, we can use the
date_created
field to extract only new opinion texts. We can also usepg-history
tables to identify opinions where the text has changed, and send updates for their embedding update.Then, the batch file IDs can be sent to the service, which will store them in a Redis queue for processing control. When the Celery queue is small enough, it can pull out a batch file ID, download the texts from S3, split them into chunks, hold them in memory, and send them in batches to the GPU for embedding generation. Afterward, the embeddings can be stored in a separate S3 bucket, where they can be retrieved by another command for indexing into Elasticsearch.
The architecture for this service would look like this:
it'll use a celery task or chain of tasks to:
I saw that GPU EC2 instances can have multiple GPUs. The idea is to have an equal number of Celery workers, each processing batches in parallel, according to the number of available GPUs.
Some additional questions from my side:
Let me know your thoughts, and if you have any additional questions or suggestions.