Chore: Delta tables should not require an empty directory

talagluck commented 1 month ago

Description

Currently, when copying to a Delta table, a user needs to create an empty directory, and then pass that empty directory in to the COPY TO command, e.g.:

COPY (SELECT 1) TO 'file:///Users/path/to/your_empty_directory'

This will raise an error if the directory does not exist, and will also raise an error if the directory is not empty. Instead, we should create the directory for the user before copying the Delta table to it. This removes ambiguity for the function and is also more consistent with the behavior of other COPY TO formats.

When copying to a CSV or Parquet, the file is created directly. If a Delta table is comprised of a directory with some files and logs in it, it makes sense that we would create the entire Delta table, including the directory, instead of filling an existing directory with the contents of a Delta table.

Instead, when copying to a delta table, it should look like this:

COPY (SELECT 1) TO 'file:///Users/path/to/not_yet_created_directory'

tychoish commented 1 month ago

This is a pretty ambiguous case, though it doesn't seem it: on the surface, the thing you expect, should just work or be able to work, but if you dig in a bit further it becomes really hard to make the semantics come out reasonable.

There are a couple of high level facts and assumptions here:

the delta specification assumes that there can be multiple independent writers to a single table.
delta (and iceberg and lance) write data to more than one file as part of their MVCC transaction process
many outputs only write to a single file. (json, bson, etc.)
glaredb would like to expose consistent semantics for a given data source/format: this is confusing for users and leads to documentation that lists lots of exceptions to rules that confuse and frustrate users.
object storage and file systems have different underlying semantics
- directories in object storage don't exist: it's (basically) flat key-value system, with "paths" just being keys with specific prefixes.
- operations to a single key are atomic

There's already some degree of awkward semantics around COPY TO, depending if the target of the operation exists or not. If the target doesn't exist, then everything should create the resources. If the target data source exists, then the operation should overwrite existing data. This is a pretty reasonable high level target, but it's hard to actually implement this:

for single file outputs on object storage, this is easy: copy to will create or overwrite the file, and if there are multiple writers you will always get the output of one and only one of the writers.
for single file outputs on local file systems, you generally don't get atomic operations, (or at least) it's a touch harder to get this to work consistently given different filesystems. I'm not sure if our file system library would do an atomic write (e.g. you have to write to a temporary issue, and then use rename)
for multifile outputs remotely, the fact that directories don't work, are in our favor: two writes can race, but the write that updates the metadata (index of pointers to versions) "wins" and readers will only see one at a time. I don't think we support this at the time.
for multifile outputs locally, there's no way to "delete" the previous files from a previous write operation cleanly so that other readers wouldn't view the table as dropped. I think most of our applicable data sources just error if there is already data where the table "should `go".
native storage, in local operation is in theory susceptible to multiple-writes, but in practice isn't really (or at least can be made to do this correctly because we don't allow multiple glaredb processes to write to the same local storage.
for (some) database systems you can do this as part of a transaction, though you're probably still better to write it to a temporary table first and then do the rename and drop in a transaction.

None of these are exactly the case we have here. I think we might be able to relax the requirement for the empty directory and just create the directory, but if there is a directory that exists and is empty, should we write into it? What about data that isn't delta-related data? It's a lot harder to say...

talagluck commented 1 month ago

I hear you. In service of trying to keep this simpler, I'm trying to paint with a broad brush, while also realizing that we're not capturing some nuance here. Taking that approach, I would say:

In the case of, say, a local CSV, COPY TO will overwrite the file if it exists.
Given that a Delta table consists of a directory containing some files, if COPY TO is used, it would overwrite the directory and anything inside of it. I don't think it should write into an empty directory.

That said, I think we should consider an "overwrite" or "if not exists" argument with COPY TO (for all file types), and we may want to consider setting the default there to raising an error instead of overwriting.

tychoish commented 1 month ago

Given that a Delta table consists of a directory containing some files, if COPY TO is used, it would overwrite the directory and anything inside of it.

This isn't really possible as an atomic operation. (that's sort of the crux of the problem). It's not (particularly) possible in the cloud, but our tools aren't as clear there.

I don't think it should write into an empty directory.

This is fine, but "will write into a directory with files in it but not if the directory exists and is empty."

If, however, you have two threads or operations running at the same time (locally) what do you do?

The "most correct answer" is to not put tables in the file system in the addressable path that a user tells us to, and rather write the files to a directory with a UUID as a name and then map table names to tables at the catalog layer. This also means that "deletes" can be a metadata operation (delete from the catalog) with cleanup deletes happening later. (I think we do something similar for native storage, or certainly can), but when we're writing files out for people it's a different story.

The multi threaded experience is worse for writing (single) local files than it is for object store files: locally we'll end up corrupting the file (on most filesystems, though perhaps not (always) on windows, definitely a silver lining) whereas on object storage only one operation can succeed.

I am in favor of "if not exists" and "overwrite" being options.

talagluck commented 1 month ago

Could you explain a bit more about atomicity and multi-threading here? I guess, more specifically, how does this case differ from, say, two people trying to write the same large Parquet file at the same time?

"will write into a directory with files in it but not if the directory exists and is empty."

I don't think it should do this either. I would think that COPY TO would fail if the directory exists at all.

The "most correct answer" is to not put tables in the file system in the addressable path that a user tells us to, and rather write the files to a directory with a UUID as a name and then map table names to tables at the catalog layer. This also means that "deletes" can be a metadata operation (delete from the catalog) with cleanup deletes happening later. (I think we do something similar for native storage, or certainly can), but when we're writing files out for people it's a different story.

Could you say a bit more about this? It feels kind of like an additional layer of log file, on top of the log file that's already built in to a Delta table. Where would the mapping live? For clarity's sake, could you sketch out what this would look like for the initial write, and then for any subsequent operations?

I think you understand the internals here much better than I do, so ELI5 here would be really helpful to get aligned.

tychoish commented 1 month ago

Could you explain a bit more about atomicity and multi-threading here? I guess, more specifically, how does this case differ from, say, two people trying to write the same large Parquet file at the same time?

Delta saves data in multiple files (i.e. only the changes/"deltas") and has a single metadata file that maps where data are and tells the client how to assemble a single table from a bunch of parquet files. This means that individual write operations can be very small (write the data to the parquet file and then the metadata file as the commit) which means that multiple writers can write to a single delta table without coordination (because of the all-or-nothing promise of the metadata write to a single file,) you get full MVCC semantics without coordination, which is pretty nifty.

There might be edge cases about two writers trying to create a single table (haven't read the details around the spec). Filesystems don't (all) have the same kinds of atomic write promises that object storage has (in change for other promises,) which means "external local filesystem tables (which might have multiple writers)" and "native storage on local file systems with a cluster of worker nodes," are not (to my mind, at the moment) definitely safe (without further testing and investigation).

talagluck commented 1 month ago

I can see how this would apply to INSERTs (I imagine that when inserting into Delta, we're actually creating a new file with the changes), but does it apply to COPY TO as well (in a way that's different from the other COPY TO formats)?

Could you also explain why creating the directory changes the outcome here? Is it that it takes two steps instead of one (creating the directory and then creating the table)?

tychoish commented 1 month ago

Could you also explain why creating the directory changes the outcome here? Is it that it takes two steps instead of one (creating the directory and then creating the table)?

Exactly.

Earlier you said we should have the following semantics:

Given that a Delta table consists of a directory containing some files, if COPY TO is used, it would overwrite the directory and anything inside of it.

On object storage, you don't actually have to delete anything to write a new table.

and then

I would think that COPY TO would fail if the directory exists at all.

These can't both exist at the same time.

The first rule isn't really possible to implement locally because the second writer would start deleting things that the first writer. You can implement this in object storage, because I think we just don't delete things, until a later vacuum operation.

The second rule is possible to implement (locally and in object storage) but does mean that you'd have to go in and manually clean up after failed operations, which is dodgy, and I think isn't super useful.

To be clear, the current behavior is weird as hell, and I think we should change it. The problem is but to what. I think we try and make a directory if possible, and maybe just proceed with directories that have files in them already (there are unlikely to be name collisions, and this is essentially what happens in object storage.

talagluck commented 4 weeks ago

Ah, thanks for explaining. That all makes sense.

What you propose seems reasonable. I think that creating the directory if it doesn't exist, and otherwise writing to the existing directory seems OK. It would help if we log what we're doing (creating a new directory vs writing to an existing directory) to make it less surprising for people.

tychoish commented 4 weeks ago

Where do you think we should be logging or noting this (and do we have other logging like this that people might expected to look at?

talagluck commented 4 weeks ago

I was imagining this would happen in the same place where we output the number of rows written, or that the operation succeeded or failed. E.g. currently this:

create table abc as select 1

outputs Table created. I'm imagining this would in the same place. There may need to be a bit of logic to rephrase based on whether this is local (Directory created) vs Cloud storage (Prefix created).

tychoish commented 4 weeks ago

There may need to be a bit of logic to rephrase based on whether this is local (Directory created) vs Cloud storage (Prefix created).

We can't do this for cloud as prefixes aren't explicitly created (e.g. objects with a prefix are created.)

I was imagining this would happen in the same place where we output the number of rows written, or that the operation succeeded or failed.

This is only an artifact of the shell itself, and I believe not something that is happening lower in the database. Plumbing the message from the data source up to the shell is a pretty big lift, and it doesn't quite answer what we do in the bindings.

talagluck commented 4 weeks ago

We can't do this for cloud as prefixes aren't explicitly created (e.g. objects with a prefix are created.)

Could we check to see whether objects with the prefix exist, and message accordingly?

This is only an artifact of the shell itself, and I believe not something that is happening lower in the database. Plumbing the message from the data source up to the shell is a pretty big lift, and it doesn't quite answer what we do in the bindings.

I'm not quite following here. Could you explain with a bit more detail?

Do you have any suggestions for another approach?

tychoish commented 4 weeks ago

I'm not quite following here. Could you explain with a bit more detail?

From the perspective of user's there isn't really anything/anywhere to log during successful operations, or at least not consistently given the different ways glaredb is used/called, because conceptually it's still client/server and the results of a successful operation are a [conceptual] cursor of results.

Do you have any suggestions for another approach?

Nope. I don't think "logging on success" is a thing we can (or should) do. We can definitely log normally, user's will (mostly) not see this.

talagluck commented 4 weeks ago

Currently, when I call something like:

con.sql("copy (select 1) to 'hello.csv'").show()

I see a table with the message:

	result
0	Copy success	'

How does this differ from logging if a table is created?

tychoish commented 3 weeks ago

These results are rendered here and while we could have a different message, it'd be a bit of a bodge. Also, I think it would be good if this operation (like all operations that write data) return a message with the total number of rows written.

talagluck commented 3 weeks ago

I think that having generally useful information about the write would be good, including the number of rows and whether it was successful.

GlareDB / glaredb

Chore: Delta tables should not require an empty directory #3160

Description