Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
12.46k stars 1.2k forks source link

Disable to create folders for exports and caches #654

Closed deltamod3 closed 10 months ago

deltamod3 commented 10 months ago

System Info

OS version: Ubuntu 20.04 Python version: Python 3.9.1 Pandasai version: 1.3.3

🐛 Describe the bug

I have tried to deploy the python script with pandasai to AWS lambda function. In AWS lambda function, it's not allowed to create folders like exports, caches (only under the tmp folders). So I tried to disable the save_chart, enable_cache, but it always creates such folders. I investigated the code and found this one.

https://github.com/gventuri/pandas-ai/blob/main/pandasai/smart_datalake/__init__.py#L77

    def initialize(self):
        """Initialize the SmartDatalake"""

        # Create exports/charts folder if it doesn't exist
        try:
            charts_dir = os.path.join((find_project_root()), "exports", "charts")
        except ValueError:
            charts_dir = os.path.join(os.getcwd(), "exports", "charts")
        os.makedirs(charts_dir, mode=0o777, exist_ok=True)

        # Create /cache folder if it doesn't exist
        try:
            cache_dir = os.path.join((find_project_root()), "cache")
        except ValueError:
            cache_dir = os.path.join(os.getcwd(), "cache")
        os.makedirs(cache_dir, mode=0o777, exist_ok=True)

It always run when create new SmartDataLake object, so it's nonsense. I am curious, if the save_chart, enable_cache are disabled, it couldn't create such folders. Am I right?

Thank you.

nautics889 commented 10 months ago

Hello, @deltamod3. Thank you for the investigation.


Since initialize() method is called from SmartDatalake's __init__() per each instantiation without any clause, it looks like those directories will appear no matter of save_chart equals to False (disabled). I haven't checked it yet, but i guess the same logic is implemented for caching as well. And i don't like it. I agree, it looks redundant. I guess it hasn't been noticed so far because people mostly use PandasAI within a regular file system, rather than Lambda e. g.


Another important moment. You've mentioned you are about to deploy pandasai on AWS lambda. So, perhaps you building some application with PandasAI to be integrated into some service. I should aware you there is some crucial points you should assess before implementing this design, regarding to use PandasAI in production. There are some potential vulnerabilities can be used by client, generally represented by "asking AI to return a malicious code". Please check out this issue: #550. The most comprehensive solution for such things (as seems to me) is to use some virtualization or containerization mechanism. Docker, for example, is being considered like an appropriate tool to be integrated in PandasAI, but it's not implemented yet. You can observe an according discussion here, if it's interesting for you. But for me, using this approach (PandasAI in a Docker container) would perfectly fit to a classic AWS EC2 instance. However it looks like there are such things as container images for Lambda. IDK, i don't have an experience of deploying one, unfortunately.

gventuri commented 10 months ago

@deltamod3 this should definitely be fixed for lambdas. The ideal solution in my opinion is not even disabling the cache and save chart, but a combination of:

Thanks for reporting @nautics889, this is very important in production. However, I think also the lambda shouldn't be affected, as it's standalone and should based on a lambda layer, which at the end should be very similar to a dockerized version.

deltamod3 commented 10 months ago

Hi @nautics889 , @gventuri , Thank you for your reply. Yes, I already deployed it with container images to lambda. (Due to lambda size limit, I can deploy it only with docker image from ECR).

It's no problem to run it on lambda function. My concern is why it always creates such folders. If user disable the cache and export, it shouldn't create. Am I clear?

Also in lambda, it can't create folders in root folder, (the root folder in lambda is /var/task), we can create folders under the /tmp folder.

So I tried to change the root folder by using os.chdir() and added empty pandasai.json (because it doesn't identify as root folder if there is no such file), so finally solved it.

But I think it's better not creating such folders for my case. Am I clear?

Thank you again.

deltamod3 commented 10 months ago

Here is my solution.

    # OpenAi settings
    llm = OpenAI(api_token=api_key,temperature=0.1)

    # Set working dir to /tmp, so we can prevent errors in lambda function
    if os.environ.get("AWS_LAMBDA_FUNCTION_NAME") is not None:
        tmp_dir = "/tmp"
    else:
        current_file_path = os.path.abspath(os.getcwd())
        tmp_dir = os.path.join(current_file_path, "tmp")
    if not os.path.exists(tmp_dir):
        os.makedirs(tmp_dir, mode=0o777, exist_ok=True)

    os.chdir(tmp_dir)
    print("Current work dir", os.getcwd())

    # Create pandasai.json file, so the pandasai can identify the folder is root folder 
    if not os.path.exists("pandasai.json"):
        file = open("pandasai.json", 'a')
        file.close()

    # Create SmartDatalake agent
    agent = SmartDatalake(
        dfs=[df],
        config={
            "llm": llm, 
            "enable_cache": False, 
            "save_logs": False
        },
        memory=Memory(memory_size=4)
    )

    # Chat with the agent
    response = agent.chat(query)
HenriqueAJNB commented 10 months ago

@deltamod3, can you please open a PR with your solution?

nautics889 commented 10 months ago

@HenriqueAJNB i doubt the code above can be treated as a solution since it looks like client's code modification aimed to overpass dir creating failure by changing cwd, not to fix the behaviour in the scope of the library. The main goal should be considered as pandasai should not create the directories when save_charts equals to False.


There is a PR covering exactly those points i've mentioned above. @deltamod3 you're welcome to test if it's still interesting for you. In short words, if you pass "save_charts"=False or "enable_cache"=False, the directories (./export/charts and ./cache respectively) won't be created.


CC: @gventuri

deltamod3 commented 10 months ago

Thank you, everyone!