Open fredajiang opened 2 months ago
Hello. As you said, both the OCR model and GroundingDino can be loaded via GPU. For the OCR model, you need to install the corresponding version of tensorflow-gpu. There is currently no suitable method to reduce the call latency for VLLM. However, when deploying projects, we often use the agent parallel solution. For example, the reflection agent can run concurrently with the planning agent in the next stage. If the reflection agent believes that the operation is correct, the time of calling the planning agent can be reduced. Of course, if you can accept the decline in model performance, gpt-4o-mini is a good choice.
How to improve the execution speed of OCR, grounding-dino, and chatgpt-4o models to transition mobile-agent from laboratory research to engineering use?