Achieve around 10x performance with these optimizations:
Upgrade to use the vllm engine (flash attention backend) for batch inference (initial batch size = 100, higher may be faster but risks OOM and diminishing returns).
Optimize regex preprocessing for faster detection of known entities.
Run post-processing on GPU instead of CPU. Initially, I avoided loading two models on the GPU, but the all-MiniLM-L6-v2 responsible for entity alignment is small enough not to impact performance significantly. Running both on GPU greatly improves speed.
Achieve around 10x performance with these optimizations:
vllm
engine (flash attention backend) for batch inference (initial batch size = 100, higher may be faster but risks OOM and diminishing returns).all-MiniLM-L6-v2
responsible for entity alignment is small enough not to impact performance significantly. Running both on GPU greatly improves speed.New ETA for completing all processing: May 10.