crate / crate-python

Python DB API client library for CrateDB, using HTTP.
https://cratedb.com/docs/python/
Apache License 2.0
79 stars 30 forks source link

Improve reading and writing pandas DataFrames #376

Closed amotl closed 10 months ago

amotl commented 3 years ago

Hi there,

influxdb-python received support for reading and writing Pandas DataFrames through https://github.com/influxdata/influxdb-python/pull/86 by @timtroendle the other day. The usage of its influxdb.DataFrameClient is outlined within tutorial_pandas.py and is just a matter of invoking

client.write_points(df, "testdrive", protocol="line")

I believe it would be nice to also see respective support for CrateDB in order to unlock quite an amount of more scenarios this driver can be used for.

With kind regards, Andreas.

amotl commented 3 years ago

When looking at the CrateDB and Linear Regression Notebook, this might not even be hard to implement and just a matter of adding a small wrapper for improved convenience, so that we do not have to write this over and over again:

# Read from database into dataframe.
df = pd.read_sql(sql_query, 'crate://localhost:4200', index_col='id')

# Write dataframe to database.
df.to_sql(tablename, 'crate://localhost', if_exists='append', index=False)

cc @patrickFuerst, @mechanomi

patrickFuerst commented 3 years ago

@amotl maybe I'm missing something, but how would a wrapper make it more convenient than just calling these two lines for reading and writing? One thing I can remember that sometimes I had to make sure that the datatypes of the data frame columns where actually the right ones after reading from CrateDB.

amotl commented 3 years ago

Hi again,

I've investigated the situation more thoroughly with a demo program for exporting pandas data frames to SQL databases.

Observations

PostgreSQL compatibility

When using the PostgreSQL protocol,

pandas_to_sql(dburi="postgres://crate@localhost/dwd")

it obviously croaks with the venerable

sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) line 1:1: mismatched input 'ROLLBACK' expecting {'SELECT', 'DEALLOCATE', 'CREATE', 'ALTER', 'KILL', 'BEGIN', 'COMMIT', 'ANALYZE', 'DISCARD', 'EXPLAIN', 'SHOW', 'OPTIMIZE', 'REFRESH', 'RESTORE', 'DROP', 'INSERT', 'VALUES', 'DELETE', 'UPDATE', 'SET', 'RESET', 'COPY', 'GRANT', 'DENY', 'REVOKE'}

Databases vs. schemas

However, the HTTP driver variant, when trying to address a specific database - here, dwd -

pandas_to_sql(dburi="crate://crate@localhost/dwd")

also croaks with:

  File "/Users/amo/dev/earthobservations/wetterdienst/.venv/lib/python3.8/site-packages/crate/client/sqlalchemy/dialect.py", line 198, in connect
    return self.dbapi.connect(servers=server, **kwargs)
TypeError: connect() got an unexpected keyword argument 'database'

Evaluation

I know that CrateDB supports neither the ROLLBACK statement nor the CREATE DATABASE statement. This fact might be added to the Unsupported features and functions section of the documentation [1] and - on the other hand - there might be room for improvements.

crate=> ROLLBACK;
ERROR:  line 1:1: mismatched input 'ROLLBACK' expecting {'SELECT', 'DEALLOCATE', 'CREATE', 'ALTER', 'KILL', 'BEGIN', 'COMMIT', 'ANALYZE', 'DISCARD', 'EXPLAIN', 'SHOW', 'OPTIMIZE', 'REFRESH', 'RESTORE', 'DROP', 'INSERT', 'VALUES', 'DELETE', 'UPDATE', 'SET', 'RESET', 'COPY', 'GRANT', 'DENY', 'REVOKE'}
crate=> CREATE DATABASE testdrive;
ERROR:  line 1:8: no viable alternative at input 'CREATE DATABASE'

Epilogue

While CrateDB fails to satisfy the CREATE DATABASE statement, pandas_to_sql(dburi="crate://crate@localhost") works perfectly well. So, it currently has advantages over both QuestDB and CockroachDB, which currently even fail a) to create the database schema and b) to connect at all due to other reasons.

With kind regards, Andreas.

[1] https://crate.io/docs/crate/reference/en/4.4/appendices/compatibility.html#unsupported-features-and-functions

inviridi commented 2 years ago

Hi @amotl,

I was inspired by your ideas but struggled more than I am willing to admit (mostly unhelpful error messages). Yet, I found a way that works for me: Just UNNEST a list of records into an object column.

def insert_df_into_crate(tn, df):
    """tn = table name, df = dataframe"""

    connection = crateclient.connect("localhost:4200")
    cursor = connection.cursor()

    cursor.execute(f"""create table if not exists {tn} (countid int) with (column_policy = 'dynamic');""")

    col_string = ""
    for col in df.columns:
        col_string = col_string + col + ", "
    col_string = col_string[:-2]

    qm_string = "".join(["?, "] * len(df.columns))[:-2]

    try:
        cursor.execute(f"INSERT INTO {tn} ({col_string}) (SELECT * FROM UNNEST({qm_string}));", 
                       (df.to_dict(orient="records"),))

    except Exception as e:
        print(f"encountered exception: {e}")
    cursor.execute(f"refresh table {tn};")
    cursor.close()
    connection.close()

df = load_json_telegraf_log("test2.json").reset_index() #load some data
insert_df_into_crate("crate_test_table", df)

I wanted to share this because others might find it helpful. Have not yet found an elegant way to get rid of the object and extract them to regular columns though.

Thanks for your input and warm regards, Frank

amotl commented 1 year ago

Dear @inviridi,

thanks a stack for sharing your code. I have to admit I did not evaluate it yet, but I am looking forward.

However, after working on different improvements about efficient data import with pandas and CrateDB recently, we just uploaded a corresponding code snippet at Demonstrate efficient batched inserts using CrateDB and pandas. It may be helpful to the community.

Specifically, the --mode=bulk option (it's the default anyway), effectively using the insert_bulk function, is more efficient than the other methods, because it uses CrateDB's bulk operations endpoint ^1, so it saves the client and server from needing to marshal and unmarshal huge SQL statements.

With kind regards, Andreas.

/cc @hlcianfagna, @hammerhead, @marijaselakovic

inviridi commented 1 year ago

Hi Andreas,

thanks a lot for your hard work on CrateDB and this specific feature! Although I use CrateDB less frequently than I'd like to, I very much enjoy seeing feature improvements, your professional interactions with the community and the helpful posts by your colleagues.

I'll check you the efficient batch inserts when the need arises and give feedback here.

All the best and happy belated Easter, Frank