Implement SoM client/server functionality

OpenAdaptAI / OpenAdapt

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

https://www.OpenAdapt.AI

MIT License

971 stars 133 forks source link

Implement SoM client/server functionality #543

Open abrichr opened 11 months ago

abrichr commented 11 months ago

Feature request

To support https://github.com/openadaptai/SoM we need to implement a client.py with https://www.gradio.app/docs/client. See:

Motivation

https://github.com/openadaptai/SoM is state-of-the-art for visual understanding, and only runs on Linux / CUDA

Refer to system diagram:

Inference (SoM/SAM) must be done remotely.

We wish to implement:

openadapt/adapters/som/client.py: modified version of client.py in https://github.com/microsoft/SoM/pull/19 to support getting marked screenshots during analysis (visualization) and replay
openadapt/adapters/som/server, which can be a git submodule containing https://github.com/OpenAdaptAI/SoM/

abrichr commented 8 months ago

@FFFiend thoughts? 🙏 😄

FFFiend commented 8 months ago

I took a look at the client.py code as well as the docs so from my understanding:

Instead of using a gradio app url or HuggingFace space, we would like to create an entry point to the EC2 SoM instance we have available and return a marked screenshot as the output of predict

Bit confused on how a marked screenshot is defined though 😄

abrichr commented 8 months ago

@FFFiend thanks for your quick response!

Instead of using a gradio app url or HuggingFace space, we would like to create an entry point to the EC2 SoM instance we have available and return a marked screenshot as the output of predict

Exactly right! We would need to integrate the deploy.py and a variation of client.py, which would both be called from elsewhere in openadapt (e.g. visualize.py, replay.py).

Bit confused on how a marked screenshot is defined though 😄

No worries! You can see the marked screenshot in the PR description, reproduced here:

The original screenshot is on the left, the marked screenshot is on the right.

FFFiend commented 8 months ago

awesome, so for client.py I'm envisioning client to work as follows:

Use start and stop from the Deploy class for instantiating and then closing the instance.
Use either paramiko https://www.paramiko.org/ or https://pexpect.readthedocs.io/en/stable/ to have a runner for functions within the instance.
Write up or reuse SOM logic from one of the existing demos (demo_som.py for example) into a function and then plug that into the runner above, and inference is done.

The original repo doesn't have any architecture or heavy ML code laid out so I'm guessing the meat n potatoes is within the demo files, but I could be wrong.

abrichr commented 7 months ago

@FFFiend thanks for your patience! Just saw this 😅

Use start and stop from the Deploy class for instantiating and then closing the instance.

Agreed.

Use either paramiko https://www.paramiko.org/ or https://pexpect.readthedocs.io/en/stable/ to have a runner for functions within the instance.

This may be unnecessary. https://github.com/microsoft/SoM includes a client.py which uses the Gradio API -- the client.py should look similar.

Write up or reuse SOM logic from one of the existing demos (demo_som.py for example) into a function and then plug that into the runner above, and inference is done.

Bingo! This should all go in client.py.