X-PLUG / MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
https://arxiv.org/abs/2406.01014
MIT License
2.68k stars 245 forks source link

How to improve the execution speed of OCR, grounding-dino, and chatgpt-4o models to transition mobile-agent from laboratory research to engineering use? #50

Open fredajiang opened 2 weeks ago

fredajiang commented 2 weeks ago

How to improve the execution speed of OCR, grounding-dino, and chatgpt-4o models to transition mobile-agent from laboratory research to engineering use?

  1. I replaced the original grounding-dino model with a GPU-supported version, reducing the time required from about 7 seconds to just 0.2 seconds. For more details on the GPU version of grounding-dino, please refer to the link: https://github.com/IDEA-Research/GroundingDINO
  2. For the OCR model, is there a similarly faster GPU-supported version? Currently, each OCR operation takes approximately 3 seconds.
  3. For calling chatgpt-4o, do you have any suggestions for improving its execution speed? At present, each call to chatgpt-4o takes approximately 6-7 seconds. Looking forward to your response.
junyangwang0410 commented 2 weeks ago

Hello. As you said, both the OCR model and GroundingDino can be loaded via GPU. For the OCR model, you need to install the corresponding version of tensorflow-gpu. There is currently no suitable method to reduce the call latency for VLLM. However, when deploying projects, we often use the agent parallel solution. For example, the reflection agent can run concurrently with the planning agent in the next stage. If the reflection agent believes that the operation is correct, the time of calling the planning agent can be reduced. Of course, if you can accept the decline in model performance, gpt-4o-mini is a good choice.