approximatelabs / sketch

AI code-writing assistant that understands data content
MIT License
2.21k stars 112 forks source link

Joining dataframes #16

Open dave-killough opened 1 year ago

dave-killough commented 1 year ago

Nice. I've been doing similar things with SQL sources. I'm curious why you attach the ask to a specific dataframe. Can multiple dataframes be considered in one request? Thanks

bluecoconut commented 1 year ago

Hey @dave-killough ,

I attached to a specific dataframe because it was the easiest to "gather" the local context. In a python kernel, just doing sketch.ask(...) and inspecting the locals() in the same parent as the call is possible but potentially leading to lots of mistaken variables and too many dataframes. Also, this is somewhat rough in jupyter kernel style sessions, where navigating the call stack and finding other variables seemed to be a bit difficult (though if you have a trick here, this might make this really easy).

In many jupyter notebooks, people often re-use the same dataframe as well (editing in place), so the clarity of the answer I felt should be contextualized to a single dataframe.

One option for multiple joining I considered was something like

df.sketch.howto("question", extra_contextual_dfs=[*dfs])

but that didn't feel "easy to type" / "native".

Do you have any ideas or preferences for what would make a good API?

dave-killough commented 1 year ago

Thanks for the explanation. Specifying the dataframes to include is important. There are often many working dataframes created in a notebook that should be excluded from end-result queries. It seems useful to specify the dataframes once, and then have a higher level ask function that can connect the dataframes. ChatGPT is quite good at modeling table relationships from just schema, and we're getting effective SQL joins as a result. I expect it could work similarly with dataframe merges.

Maybe something like this:

sketch.dataset([client_df, purchase_df, contact_df])
sketch.ask("who bought the most last year?")
dnk8n commented 1 year ago

Asking questions across multiple dataframes would be very interesting. I was hoping to do something like this:

import pandas as pd
import sketch
from sqlalchemy import create_engine

# Define a class to store views
class Views:
    def __init__(self, schema_name, db_url):
        self.views = {}
        self.history = []
        self.engine = create_engine(db_url)
        self.schema_name = schema_name

        # Get all views in the schema
        views = pd.read_sql_query(f"SELECT table_name FROM information_schema.views WHERE table_schema = '{schema_name}'", con=self.engine)

        # Loop through each view and read it into a Pandas dataframe
        for view in views['table_name']:
            df = pd.read_sql_query(f"SELECT * FROM {schema_name}.{view}", con=self.engine)
            self.add_view(view, df)

    def add_view(self, name, df):
        self.views[name] = df

    def ask(self, question):
        answer = sketch.ask(question, self.views.items())
        self.history.append((question, answer))

# Connect to Postgres
db_url = 'postgresql://user:password@localhost:5432/mydatabase'
schema_name = 'my_schema'

# Create an instance of Views
my_views = Views(schema_name, db_url)

# Ask questions of all views and store history
my_views.ask('What are the top problematic projects with respect to planned vs actual time tracked?')
my_views.ask('Which activities appear to require upskilling?')

# Interrogate my_views.history for example