OthersideAI / self-operating-computer

A framework to enable multimodal models to operate a computer.
https://www.hyperwriteai.com/self-operating-computer
MIT License
8.86k stars 1.19k forks source link

Proposal: Transitioning from Chrome-Exclusive to Universal Browser Compatibility #60

Open centopw opened 11 months ago

centopw commented 11 months ago

Problem

Currently, the application is prompt to use Google Chrome by default, limiting accessibility and user experience for individuals using alternative browsers. This monolithic approach excludes a significant user base and hinders the platform's adaptability to diverse browser environments.

Proposal

This issue advocates for a transition from Chrome-centric development to a more inclusive approach that supports a broader range of web browsers. The goal is to enhance accessibility, improve user experience, and adhere to web standards that promote compatibility across different platforms.

Proposed Changes

When testing I realize that on MacOS you can open your default browser by just type in the search bar

browser

So instead of Google Chrome you can search browsers then enter it will open the browser without the need of user have to use Google Chrome. Since most browser have the search bar at the same location you can still use the default setting for it.

michaelhhogue commented 11 months ago

@centopw Thanks for this proposed change. It's interesting to see that you can just open the default browser by searching for "browser" in Mac OS. Do you have any ideas on how the default browser could be opened on Windows and Linux? I've tested just searching for "browser" on my Linux distro and it doesn't find the default.

centopw commented 11 months ago

Issue Description

When searching for browsers on different Linux distros, the current behavior is as follows:

Ubuntu 22.04.3

Kali Linux 2023.3

Proposed Changes

Two potential solutions have been considered:

  1. Script Improvement (PR #19): Enhance the existing scripts to prompt the user for their default browser choice and update the main.py with the selected browser.

  2. Update main.py: Modify main.py to prompt the user to select the default browser every time it runs.

Pros & Cons

Both options offer improved accuracy:

Drawbacks:

  1. Option 1:

    • Pros: Users can set their preferred default browser with the updated scripts.
    • Cons: Users must run the additional script (#19) for installation; otherwise, it defaults to Google Chrome.
  2. Option 2:

    • Pros: User flexibility in selecting the default browser each time.
    • Cons: Users are required to input their default browser choice with every run.
centopw commented 11 months ago

With this proposal I have draft a simple update for the main.py as below:

 # Ask the user for their default browser
    default_browser = prompt(
        "Please enter your default browser (e.g., Chrome, Firefox): "
    )

    # Adjust the behavior based on the user's default browser
    if default_browser.lower() == "chrome":
        browser_prompt = "Google Chrome"
        browser_address_bar = {"x": "50%", "y": "9%"}
    elif default_browser.lower() == "firefox":
        browser_prompt = "Mozilla Firefox"
        browser_address_bar = {"x": "50%", "y": "10%"}
    else:
        # Default to Chrome behavior if the input is unknown
        browser_prompt = "Google Chrome"
        browser_address_bar = {"x": "50%", "y": "9%"}

    message_dialog(
        title="Self-Operating Computer",
        text=f"Ask a computer to do anything. Default browser set to {browser_prompt}.",
        style=style,
    ).run()

    print("SYSTEM", platform.system())

    # Update the prompts based on the chosen/default browser
    VISION_PROMPT = f"""
    You are a Self-Operating Computer. You use {browser_prompt} as your default browser.

    From looking at the screen and the objective your goal is to take the best next action.

    To operate the computer you have the four options below.

    1. CLICK - Move mouse and click
    2. TYPE - Type on the keyboard
    3. SEARCH - Search for a program on {browser_prompt} and open it
    4. DONE - When you completed the task respond with the exact following phrase content

    Here are the response formats below.

    1. CLICK
    Response: CLICK {{ "x": "percent", "y": "percent", "description": "~description here~", "reason": "~reason here~" }}

    2. TYPE
    Response: TYPE "value you want to type"

    2. SEARCH
    Response: SEARCH "app you want to search for on {browser_prompt}"

    3. DONE
    Response: DONE

    Here are examples of how to respond.
    ...
    """
centopw commented 11 months ago

Also Instead of asking user to type out we can incorporate a menu function that allow user to select a pre-define selection of browser

michaelhhogue commented 11 months ago

@centopw Interesting. I think the ideal solution would be to just automatically detect the default browser if possible. On Windows, I'm pretty sure this can just be read from the registry using OpenKey. For Linux, this would probably be found in xdg-settings. I'm not sure about Mac OS. It would probably require some special permissions to access that system setting. If no default browser was found, it could just default to searching for "browser" or something. What do you think about this approach?

centopw commented 11 months ago

If you want to go with terminal approach we could simply open any website then from the terminal ex:

When run this command in the terminal it will automatically open with default browser on each system. One more thing that I think we could benefit from this is since it always open the google.com website so we can define where the search location is avoid miss click even more

Screenshot 2023-12-02 at 5 08 57 PM
michaelhhogue commented 11 months ago

@centopw That's an interesting approach. However, the project is aiming more towards only giving the model control over the OS via mouse movements, mouse clicks, key-presses, and search operations (from key-presses). Running xdg-open, start, or open from the code itself would violate that vision (restricting the model to only have the same inputs to the OS as a human: mouse and keyboard).

So, having the model open a terminal and run xdg-open using only the cursor and key-presses would be a valid operation (although not very practical). Running xdg-open from the python code itself wouldn't be valid. Hope that makes sense.

The program should probably follow this order of operations:

Get name of the user's default browser (either manually or automatically) -> Give default browser name to model in prompt -> Model references default browser name to be included in the search action.

michaelhhogue commented 11 months ago

@centopw I am going to try out your install script in #19 and see how it works.

centopw commented 11 months ago

@michaelhhogue Then how about this? I don't really work with Windows that much so this draft only work with Mac using webbrowser and Linux xdg-setting,

def get_default_browser_macos():
        return webbrowser.get().name

def get_default_browser_linux():
        result = subprocess.run(["xdg-settings", "get", "default-web-browser"], stdout=subprocess.PIPE, text=True)
        browser_name = result.stdout.strip()
        return browser_name
michaelhhogue commented 11 months ago

@centopw I'll test this out as well and get back with you.

Kreijstal commented 11 months ago

What if browser is already open?

centopw commented 11 months ago

@Kreijstal For now I don't think if the browser open effect anything. But that is an interesting ideas I will play around with it and let you know.

michaelhhogue commented 11 months ago

@centopw Just noting here that I haven't yet tested any default browser checking. Want to first see what happens with #19.

joshbickett commented 11 months ago

Problem

Currently, the application is prompt to use Google Chrome by default, limiting accessibility and user experience for individuals using alternative browsers. This monolithic approach excludes a significant user base and hinders the platform's adaptability to diverse browser environments.

Proposal

This issue advocates for a transition from Chrome-centric development to a more inclusive approach that supports a broader range of web browsers. The goal is to enhance accessibility, improve user experience, and adhere to web standards that promote compatibility across different platforms.

Proposed Changes

When testing I realize that on MacOS you can open your default browser by just type in the search bar

browser

So instead of Google Chrome you can search browsers then enter it will open the browser without the need of user have to use Google Chrome. Since most browser have the search bar at the same location you can still use the default setting for it.

I originally hacked in Google Chrome as the default, but agree we've out grown this. Chrome is 70% of the market if I understand correctly though. Would it make sense to "check for chrome" and if it doesn't find it then search for "browser" as shown above?

- Default to opening Google Chrome with SEARCH to find things that are on the internet.
joshbickett commented 11 months ago

With this proposal I have draft a simple update for the main.py as below:

 # Ask the user for their default browser
    default_browser = prompt(
        "Please enter your default browser (e.g., Chrome, Firefox): "
    )

    # Adjust the behavior based on the user's default browser
    if default_browser.lower() == "chrome":
        browser_prompt = "Google Chrome"
        browser_address_bar = {"x": "50%", "y": "9%"}
    elif default_browser.lower() == "firefox":
        browser_prompt = "Mozilla Firefox"
        browser_address_bar = {"x": "50%", "y": "10%"}
    else:
        # Default to Chrome behavior if the input is unknown
        browser_prompt = "Google Chrome"
        browser_address_bar = {"x": "50%", "y": "9%"}

    message_dialog(
        title="Self-Operating Computer",
        text=f"Ask a computer to do anything. Default browser set to {browser_prompt}.",
        style=style,
    ).run()

    print("SYSTEM", platform.system())

    # Update the prompts based on the chosen/default browser
    VISION_PROMPT = f"""
    You are a Self-Operating Computer. You use {browser_prompt} as your default browser.

    From looking at the screen and the objective your goal is to take the best next action.

    To operate the computer you have the four options below.

    1. CLICK - Move mouse and click
    2. TYPE - Type on the keyboard
    3. SEARCH - Search for a program on {browser_prompt} and open it
    4. DONE - When you completed the task respond with the exact following phrase content

    Here are the response formats below.

    1. CLICK
    Response: CLICK {{ "x": "percent", "y": "percent", "description": "~description here~", "reason": "~reason here~" }}

    2. TYPE
    Response: TYPE "value you want to type"

    2. SEARCH
    Response: SEARCH "app you want to search for on {browser_prompt}"

    3. DONE
    Response: DONE

    Here are examples of how to respond.
    ...
    """

I lean away from asking the user additional questions if possible, but curious what the community thinks