GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
582 stars 37 forks source link

BUG: COPY TO Delta Table with Python bindings with .to_pandas() errors #3067

Open talagluck opened 2 weeks ago

talagluck commented 2 weeks ago

Description

When COPYing TO a Delta table using the Python bindings, if .show() is added to the SQL statement, then the copy to works but then immediately errors with:

DataFusionErrorException: External error: External error: Generic error: A Delta Lake table already exists at that location.

Code to reproduce (with delta):

con.sql("""
copy (
select * from read_postgres('postgresql://postgres:postgres@localhost:5433/postgres','public','web_events')
) to 'file:///Users/talgluck/Documents/glaredb/videos/new_copy_to' format delta
""").to_pandas()
tychoish commented 2 weeks ago

typo

talagluck commented 2 weeks ago

I just reproduced the error again with COPY TO. Re-opening.

tychoish commented 2 weeks ago

The error message looks correct: COPY TO is unclear what should happen if you re-execute it: for single-file output, you can overwrite an existing file (which can be--even on object storage--atomic), for multi-file formats like lance/delta, I think it's probably better to error if there are files there: we can't do multi-file transactions safely, COPY TO is supposed to create a new thing, anyway. so erroring rather than overwriting feels safe.)

talagluck commented 1 week ago

Yes, I agree that if re-running, that makes sense as an error. The thing that is surprising here is that calling .to_pandas() seems to be re-running the command.

tychoish commented 1 week ago

Right, to_pandas (etc.) take the the object (which is just a wrapperDataFusionErrorException: External error: External error: Generic error: A Delta Lake table already exists at that location. around the query and the session), so all of the methods on this object end up rerunning the query.

This is why/how you can reference a variable from python which holds a query object in another query as a table.

I'm not sure I remember what the output of copy-to is, but I'm not sure what use it would be in a panda's dataframe? I guess I'm not exactly sure what the desired behavior is here.

I think this would happen with any ddl operation.

talagluck commented 1 week ago

Yup - to be clear, this is user error. I'll often just take the same line of con.sql("insert query").to_pandas() and copy it, then sometimes forgot that the .show() or .to_pandas() is there for DDL operations. Is there any way to improve the error message here? Like catch the error differently if the python bindings are being used for DDL operations?

tychoish commented 1 week ago

at least in the current system, this is... possible but kind of awkward.

I thnk we can put something on this struct to make it clearer when we've run something (at this very high level) and provide a better error message in this case.

We could also see about wrapping the execution plans for inserts/writes/etc, but that's more invasive and requires a lot of updates in a bunch of different places. (also you can imagine an INSERT INTO operation working multiple times, as designed)

talagluck commented 1 week ago

I would definitely not want an INSERT INTO operation to run multiple times because I added a .to_pandas() call at the end. I can't think of any operations that I would want to run multiple times without realizing it.

tychoish commented 1 week ago

If you had an INSERT INTO operation that ran a query that had some kind of dynamic operation (e.g. called a function that accessed an external resource, or even something like now(),) I think you could reasonably expect that that would run every time. you executed it?

While the semantics of to_pandas() runs the query, it's nothing about to_pandas() in particular: the same holds true for .execute() and .show(), and I don't think "DDLs only execute once when called by a method which processes their output, but might execute more than once if called with .execute()." is particularly useful.

DDL and DML statements also execute eagerly, even if you call them in a way (e.g. via .sql()) that would be lazy for a normal query.

There are some options, and I think all of them have some bad outcomes:

I think the current behavior is simple and defensible, but I agree somewhat confusing. There are other things we could implement (at some cost, for design, complexity, and impact on the system as a whole) but I'm not sure that any of them are specifically worth the cost, but we could work through a number of solutions.

talagluck commented 1 week ago

I think we may be talking past each other, and it might be because I don't fully understand the internals here.

If I write something like

con.sql("select * from my_table")

from my perspective as a user, nothing really happens. My understanding is that this because for queries like this, GlareDB evaluates lazily when using con.sql() instead of con.execute().

When I then add .show() or .to_pandas() to the end of that, I see the output of the query, because GlareDB is actually executing. In this case, nothing is run twice, but it also wouldn't usually be a big deal if it were run twice because it's a select statement.

If I instead write a DDL operation like

con.sql("COPY (SELECT 1) TO file.csv")

My understanding is that even though this uses con.sql(), we evaluates this eagerly, because it's a DDL operation. But that means that the exact same operation is run again if .show() or .to_pandas() is added.

What I would like is some kind of detection whether con.sql() is being evaluated lazily or eagerly. If is lazy, then .to_pandas() can do what it needs to do to produce the expected output. If it is eager, then it should raise an error or at least a warning. I think what I'm missing is what actually happens when .show() or .to_pandas() is called.

tychoish commented 1 week ago

If it is eager, then it should raise an error or at least a warning. I think what I'm missing is what actually happens when .show() or .to_pandas() is called.

Basically, (at least on the rust side) all method on the connection object including .sql() return an Operation object, which doesn't execute until it's resolved (show()/to_<output>()), and can then execute multiple times. If it's a DDL operation we run it first but then return the object.

All Operation can be read as Tables in later statement (referenced by the same variable name in a query"), and so they need to be re-runable, including operations that are created with execute(). This is all very high level, and happens (in rust) but above the entire execution layer: we can annotate the ones that were DDLs and passed in as SQL, but we that might miss DDLs passed in as .execute() so it gets a bit delicate.