Tool fails to run Python code after package installation and creates excessive files

EtiennePerot / safe-code-execution

Code execution utilities for Open WebUI & Ollama

Apache License 2.0

194 stars 12 forks source link

Tool fails to run Python code after package installation and creates excessive files #15

Open smuotoe opened 1 month ago

smuotoe commented 1 month ago

Description

When attempting to plot stock prices using the code execution tool, it only runs the bash script for package installation but fails to execute the subsequent Python code. Additionally, it creates an excessive number of files (over 1000) during this process.

General information

Open WebUI version: v0.3.23
Tool/function version: 0.6.0
Open WebUI setup:
- Kernel information: Linux 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Runtime: Docker
- If running in Docker:
- Docker version: Docker version 26.1.4, build 5650f9b
- docker run command: docker run --privileged=true --cgroupns=host -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
- Docker container info: https://raw.githubusercontent.com/smuotoe/open-webui-code-execution/refs/heads/master/docker_inspect.txt

Steps to Reproduce

Use the Open WebUI with the code execution function enabled.
Submit the following prompt to the LLM: "plot the stock price of Apple, Tesla, Nvidia and Microsoft YTD"
The LLM typically responds with two code blocks: a. A bash script to install required packages b. Python code to create the stock price plot
Attempt to run the provided code using the "Run code" button at the end of the response.

Expected Behavior

The tool should:

Execute the bash script to install required packages
Run the Python code to create and display the stock price plot
Create only necessary files related to the plot or temporary data

Actual Behavior

The tool:

Executes the bash script for package installation
Creates over 1000 files in the process. Full disclosure, I increased the max file limit to 2048, else, it results in an error.
Stops execution after the bash script, failing to run the Python code
Does not produce the expected stock price plot.

EtiennePerot commented 1 month ago

Thanks for the report. This is an interesting case. There are a few problems to solve before this can really work.

The first and main problem is that the "Run code" button only runs one code block from the LLM's message, never multiple. So in this case this is just running the pip install command, and the Python code never actually runs. If you did generate a different LLM message that contained just the Python code, it would run, but it would do so within a different sandbox, which means all the PIP packages installed by the pip install command would actually be absent.

The second problem is these .cache/pip files. These files are created by pip and it is legitimately how pip caches the packages it downloads. It's not creating "excessive files" from its perspective, it's just that the sandbox limits are set such that they are because I wasn't expecting pip install to run there.

Perhaps the fix for this second problem is to ignore files under .cache in the output file detection, and don't count them against the per-execution file limit, since they wouldn't be copied out of the sandbox.

This doesn't solve the first problem though. Perhaps the function should be updated to sequentially run all code blocks of a message, rather than just one. The problem with this though is that some LLMs will generate other unrelated code blocks, such as example output or some example of how to run the script after having written it. It would be bad if those code blocks were to run. Let me know if you can think of a better way to select which blocks to run from a message.

(Then there's a third problem, which is that even if all the above worked well, there's no support for showing images yet (only text files), so the stock chart would still not show up. That one should be relatively easy to fix though.)

smuotoe commented 1 month ago

Thanks for your detailed response.

Your suggestion to ignore .cache files is reasonable and should fix the "excess files" issue. Code execution is indeed a complex issue; a solution that comes to mind would be to have "Run code" buttons on each generated code block to allow the user to decide which code block to execute. In addition, every message within a thread should share a sandbox.

While the implementation may be non-trivial, I think these fixes are worth pursuing since most (Python) use cases would likely involve package installation.

updawg commented 1 month ago

Thanks for the report. This is an interesting case. There are a few problems to solve before this can really work.

The first and main problem is that the "Run code" button only runs one code block from the LLM's message, never multiple. So in this case this is just running the pip install command, and the Python code never actually runs. If you did generate a different LLM message that contained just the Python code, it would run, but it would do so within a different sandbox, which means all the PIP packages installed by the pip install command would actually be absent.

The second problem is these .cache/pip files. These files are created by pip and it is legitimately how pip caches the packages it downloads. It's not creating "excessive files" from its perspective, it's just that the sandbox limits are set such that they are because I wasn't expecting pip install to run there.

Perhaps the fix for this second problem is to ignore files under .cache in the output file detection, and don't count them against the per-execution file limit, since they wouldn't be copied out of the sandbox.

This doesn't solve the first problem though. Perhaps the function should be updated to sequentially run all code blocks of a message, rather than just one. The problem with this though is that some LLMs will generate other unrelated code blocks, such as example output or some example of how to run the script after having written it. It would be bad if those code blocks were to run. Let me know if you can think of a better way to select which blocks to run from a message.

(Then there's a third problem, which is that even if all the above worked well, there's no support for showing images yet (only text files), so the stock chart would still not show up. That one should be relatively easy to fix though.)

Maybe an additional button run with pre-reqs and the user can provide the pip packages to be installed in the sandbox before running the code?

EtiennePerot commented 1 month ago

A few more thoughts on this issue.

Maybe an additional button run with pre-reqs and the user can provide the pip packages to be installed in the sandbox before running the code?

This solution doesn't seem great to me, because Open WebUI (at least currently) has no good way of displaying multiple buttons with different appearance. If there were two buttons, they would just show up as two identical-looking sparkle buttons under each message. The only way to tell the difference would be to hover the text of each button to find out what they do. Perhaps a future Open WebUI enhancement will solve this issue, but even so it doesn't seem like a great user experience.

a solution that comes to mind would be to have "Run code" buttons on each generated code block to allow the user to decide which code block to execute. In addition, every message within a thread should share a sandbox.

The other annoying Open WebUI limitation is that there is no way for the user to specify which code blocks within a single message the user wants to run; that's simply not a thing that Open WebUI transmits to functions. They only apply to whole messages.

I believe it should be possible to improve the user experience without adding UI complexity by improving the heuristics used to determine which code blocks to run. Specifically, it should be fairly easy to detect the case where there is one Bash (or shell-like) code block that contains the substring "pip install", followed by one Python code block. I believe this is a common-enough case that adding special support for it seems worthwhile. In such case, the sandbox can run both code blocks in a sequence.

Another problem users may run into when doing this is the time it takes to download and install pip packages which may run against the default code execution duration limit of 30 seconds. Given the prevalence of this use-case, I also think that adding a customizable valve containing a comma-separated list of Python packages to pre-install would be useful. Such packages could be pre-installed in a per-user Python environment (initialized only once, not once per message), and would run under a separate sandbox with a much longer timeout (~30 minutes?) before any message-specific code runs.

My sense is that this solution would work for 80% of the "I want to run multiple code blocks" use-cases while not requiring any Open WebUI change and not adding UI complexity. As the remaining 20% of cases emerge, we could probably add similar heuristics to cover them.

EtiennePerot commented 1 month ago

An update on this issue: Two large commits (linked above) are now submitted. Neither of them fully solves this problem, but they are prerequisite towards solving it.

ab235ec194b0250121c14eba6ab50bf9d3f8e2ff adds library functionality to run multiple snippets of code sequentially in a sandbox. This is needed to run multiple code blocks. But the tool/function do not use this functionality yet.
14a6f3ddc191f16c64f405cc1501a3cf4579a860 refactors the code evaluation within the sandbox to be triggered based on a client/server model. This is necessary so that execution can be started and stopped across time (e.g. using gVisor checkpoint/restore). This will be useful so that users can preinstall Debian and Python packages without affecting the host filesystem, as it will save all the "overlay" files (including Debian package files and the pip cache) in such a way that they can be restored instantly during future code executions, so that future code invocations do not need to reinstall those packages.